All three of my proxmox servers restart unexpectedly.

Nov 2, 2023
13
0
1
Good afternoon.

I hereby request your help with a case that is happening, I have 3 DELL PowerEgde R720 physical servers with Proxmox 7.2-3 and in the last 4 weeks the three physical servers have been restarting unexpectedly.

These three servers are configured in a cluster and have high availability using CEPH.

What can I do to solve this situation?
 
Hi José,

Welcome to the forums!

What can I do to solve this situation?
My first reaction would be: troubleshoot!

What are the circumstances under which the servers reboot? Do they restart all three at the same time, or one time the one and then another day the other? Which causes have you checked and excluded? How long have the servers been running without problems? Having three servers develop an instability at the same time is curious.

In my personal experience, unexpected reboots are mostly due to hardware faults or incompatibilities, power disturbances, kernel/driver issues with "exotic" hardware, or configuration choices. This is not Proxmox specific, of course. More experienced users can probably add a few ;-)

Let me give a possible example of each category:
  • Hardware faults and incompatibilities: I had some faulty RAM, that would crash my system if load became above a certain threshold. I guess there also had to run a critical process in the faulty area. This is a bit far fetched for three systems at once, but if load historically only has been 50% and started rising lately, leading to a swap of RAM, it could be a possibility. Did you run a memory check?
  • Power disturbances: I've had undervolted and overvolted systems misbehave in private sphere, not in "professional IT" environments. If all three servers reboot at the same time, I'd keep an eye on voltage drops (or spikes).
  • Kernel/driver issues with "exotic" hardware: not really that exotic, but mostly hardware of vendors that give to few details of their product to have a reliably working open source driver. Regular use cases are often reverse-engineered, but edge cases can lead to kernel panic.
  • Configuration choices: running with ample RAM and no swap works, as long as there is enough RAM to keep the OoM-killer at bay. You have swap configured, I suppose?
These are just some examples. What information do the logs give you? With a cluster and high availability CEPH running, do you have monitoring and an external log server that can hold a clue?
 
Hi José,

Welcome to the forums!


My first reaction would be: troubleshoot!

What are the circumstances under which the servers reboot? Do they restart all three at the same time, or one time the one and then another day the other? Which causes have you checked and excluded? How long have the servers been running without problems? Having three servers develop an instability at the same time is curious.

In my personal experience, unexpected reboots are mostly due to hardware faults or incompatibilities, power disturbances, kernel/driver issues with "exotic" hardware, or configuration choices. This is not Proxmox specific, of course. More experienced users can probably add a few ;-)

Let me give a possible example of each category:
  • Hardware faults and incompatibilities: I had some faulty RAM, that would crash my system if load became above a certain threshold. I guess there also had to run a critical process in the faulty area. This is a bit far fetched for three systems at once, but if load historically only has been 50% and started rising lately, leading to a swap of RAM, it could be a possibility. Did you run a memory check?
  • Power disturbances: I've had undervolted and overvolted systems misbehave in private sphere, not in "professional IT" environments. If all three servers reboot at the same time, I'd keep an eye on voltage drops (or spikes).
  • Kernel/driver issues with "exotic" hardware: not really that exotic, but mostly hardware of vendors that give to few details of their product to have a reliably working open source driver. Regular use cases are often reverse-engineered, but edge cases can lead to kernel panic.
  • Configuration choices: running with ample RAM and no swap works, as long as there is enough RAM to keep the OoM-killer at bay. You have swap configured, I suppose?
These are just some examples. What information do the logs give you? With a cluster and high availability CEPH running, do you have monitoring and an external log server that can hold a clue?
Hi wbk!

Thanks for answer to me.

I found this at logs of my three servers:

Nov 01 10:39:29 srv3-resonancia corosync[1437]: [TOTEM ] Retransmit List: 3 5
Nov 01 10:39:30 srv3-resonancia pmxcfs[1430]: [status] notice: cpg_send_message retry 50
Nov 01 10:39:30 srv3-resonancia corosync[1437]: [TOTEM ] Retransmit List: 3
Nov 01 10:39:31 srv3-resonancia pmxcfs[1430]: [status] notice: cpg_send_message retry 60
Nov 01 10:39:31 srv3-resonancia corosync[1437]: [QUORUM] Sync members[3]: 1 2 3
Nov 01 10:39:31 srv3-resonancia corosync[1437]: [QUORUM] Sync joined[1]: 1
Nov 01 10:39:31 srv3-resonancia corosync[1437]: [QUORUM] Sync left[1]: 1
Nov 01 10:39:31 srv3-resonancia corosync[1437]: [TOTEM ] A new membership (1.164e7) was formed. Members joined: 1 left: 1
Nov 01 10:39:31 srv3-resonancia corosync[1437]: [TOTEM ] Failed to receive the leave message. failed: 1
Nov 01 10:39:32 srv3-resonancia pmxcfs[1430]: [status] notice: cpg_send_message retry 70
Nov 01 10:39:33 srv3-resonancia pmxcfs[1430]: [status] notice: cpg_send_message retry 80
Nov 01 10:39:33 srv3-resonancia corosync[1437]: [KNET ] rx: host: 2 link: 1 is up
Nov 01 10:39:33 srv3-resonancia corosync[1437]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
-- Reboot --


Nov 01 10:39:22 srv2-tecnologia pvescheduler[598497]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Nov 01 10:39:23 srv2-tecnologia corosync[1428]: [KNET ] link: host: 3 link: 1 is down
Nov 01 10:39:23 srv2-tecnologia corosync[1428]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 01 10:39:23 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 10
Nov 01 10:39:24 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 20
Nov 01 10:39:25 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 30
Nov 01 10:39:26 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 40
Nov 01 10:39:27 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 50
Nov 01 10:39:28 srv2-tecnologia corosync[1428]: [KNET ] rx: host: 3 link: 1 is up
Nov 01 10:39:28 srv2-tecnologia corosync[1428]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 01 10:39:28 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 60
Nov 01 10:39:28 srv2-tecnologia watchdog-mux[1092]: client watchdog expired - disable watchdog updates
Nov 01 10:39:29 srv2-tecnologia corosync[1428]: [TOTEM ] Token has not been received in 2737 ms
Nov 01 10:39:29 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 70
Nov 01 10:39:29 srv2-tecnologia corosync[1428]: [TOTEM ] Retransmit List: 3 4 5
Nov 01 10:39:30 srv2-tecnologia corosync[1428]: [TOTEM ] Retransmit List: 3 5
Nov 01 10:39:30 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 80
Nov 01 10:39:31 srv2-tecnologia corosync[1428]: [QUORUM] Sync members[3]: 1 2 3
Nov 01 10:39:31 srv2-tecnologia corosync[1428]: [QUORUM] Sync joined[1]: 1
Nov 01 10:39:31 srv2-tecnologia corosync[1428]: [QUORUM] Sync left[1]: 1
Nov 01 10:39:31 srv2-tecnologia corosync[1428]: [TOTEM ] A new membership (1.164e7) was formed. Members joined: 1 left: 1
Nov 01 10:39:31 srv2-tecnologia corosync[1428]: [TOTEM ] Failed to receive the leave message. failed: 1
Nov 01 10:39:31 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 90
Nov 01 10:39:32 srv2-tecnologia corosync[1428]: [KNET ] link: host: 3 link: 1 is down
Nov 01 10:39:32 srv2-tecnologia corosync[1428]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 01 10:39:32 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 100
Nov 01 10:39:32 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retried 100 times
Nov 01 10:39:32 srv2-tecnologia pmxcfs[1422]: [status] crit: cpg_send_message failed: 6
Nov 01 10:39:32 srv2-tecnologia pve-firewall[1488]: firewall update time (11.302 seconds)
Nov 01 10:39:33 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 10
Nov 01 10:39:34 srv2-tecnologia corosync[1428]: [KNET ] link: host: 3 link: 0 is down
Nov 01 10:39:34 srv2-tecnologia corosync[1428]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 01 10:39:34 srv2-tecnologia corosync[1428]: [KNET ] host: host: 3 has no active links
Nov 01 10:39:34 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 20
-- Reboot --


Nov 01 10:39:24 srv1-data-center pvescheduler[598831]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Nov 01 10:39:24 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 70
Nov 01 10:39:25 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 10
Nov 01 10:39:25 srv1-data-center corosync[1433]: [TOTEM ] Retransmit List: 9 a
Nov 01 10:39:25 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 80
Nov 01 10:39:26 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 20
Nov 01 10:39:26 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 90
Nov 01 10:39:27 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 30
Nov 01 10:39:27 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 100
Nov 01 10:39:27 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retried 100 times
Nov 01 10:39:27 srv1-data-center pmxcfs[1365]: [status] crit: cpg_send_message failed: 6
Nov 01 10:39:27 srv1-data-center pve-firewall[1493]: firewall update time (11.308 seconds)
Nov 01 10:39:28 srv1-data-center corosync[1433]: [TOTEM ] Token has not been received in 2737 ms
Nov 01 10:39:28 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 40
Nov 01 10:39:28 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 10
Nov 01 10:39:28 srv1-data-center watchdog-mux[1097]: client watchdog expired - disable watchdog updates
Nov 01 10:39:29 srv1-data-center corosync[1433]: [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Nov 01 10:39:29 srv1-data-center corosync[1433]: [KNET ] link: host: 3 link: 1 is down
Nov 01 10:39:29 srv1-data-center corosync[1433]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 01 10:39:29 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 50
Nov 01 10:39:29 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 20
Nov 01 10:39:30 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 60
Nov 01 10:39:30 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 30
Nov 01 10:39:31 srv1-data-center corosync[1433]: [QUORUM] Sync members[3]: 1 2 3
Nov 01 10:39:31 srv1-data-center corosync[1433]: [QUORUM] Sync joined[2]: 2 3
Nov 01 10:39:31 srv1-data-center corosync[1433]: [QUORUM] Sync left[2]: 2 3
Nov 01 10:39:31 srv1-data-center corosync[1433]: [TOTEM ] A new membership (1.164e7) was formed. Members joined: 2 3 left: 2 3
Nov 01 10:39:31 srv1-data-center corosync[1433]: [TOTEM ] Failed to receive the leave message. failed: 2 3
Nov 01 10:39:31 srv1-data-center corosync[1433]: [KNET ] rx: host: 3 link: 1 is up
Nov 01 10:39:31 srv1-data-center corosync[1433]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 01 10:39:31 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 70
Nov 01 10:39:31 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 40
Nov 01 10:39:32 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 80
Nov 01 10:39:32 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 50
Nov 01 10:39:33 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 90
-- Reboot --

What Can I do??
 
Hi José,

What I notice:
  • All three servers reboot at the same moment, Nov 01 10:39:33
  • There are many cpg_send_message retry's, which does not seem the way it should
  • ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus. ; more than 4 seconds is quite long. Are all nodes in the same building / city?
Which log is it?

Once I know, I could compare it to my own. At first glance, a log of which half the lines are "send message retry", and the remainder mention "timeout", "processor failed", "link down" and "failed", I would be alarmed.

Even so, I think this logs only shows symptoms of an underlying problem, not the cause itself.
 
Hi José,

What I notice:
  • All three servers reboot at the same moment, Nov 01 10:39:33
  • There are many cpg_send_message retry's, which does not seem the way it should
  • ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus. ; more than 4 seconds is quite long. Are all nodes in the same building / city?
Which log is it?

Once I know, I could compare it to my own. At first glance, a log of which half the lines are "send message retry", and the remainder mention "timeout", "processor failed", "link down" and "failed", I would be alarmed.

Even so, I think this logs only shows symptoms of an underlying problem, not the cause itself.
Hi WBK.

Each server is in different buildings, in the same city (they belong to a health institution)
All the information I sent was taken from syslog in the visual part of Proxmox.
Is there a path where the Proxmox logs are in text mode?
 
Hi José,

Each server is in different buildings.. health...
That woul maked power difficulties extremely far fetched

Is there a path where the Proxmox logs are in text mode?
Yes, there is. Proxmox is built on Debian.

Do you have experience troubleshooting Linux systems? It seems there are network problems, but I would learn something new if that turned out to be causing the synchronous reboot.

Log in to the server via SSH, and have a look in /var/log for more logging.
 
Hi José,


That woul maked power difficulties extremely far fetched


Yes, there is. Proxmox is built on Debian.

Do you have experience troubleshooting Linux systems? It seems there are network problems, but I would learn something new if that turned out to be causing the synchronous reboot.

Log in to the server via SSH, and have a look in /var/log for more logging.

Here are the /var/log/syslog files from the servers.
Please, check out the attached files
 
servers has reboot because you use HA, and you loose quorum for too much time.

Do you manage network between buildings ? Personally, I'll avoid to use HA with metro-cluster on differents building, until you have dedicated fibers for you proxmox corosync netwoork.

You could have network problem, link saturation (you should really never have bandwith saturation or you'll have problem like this).
 
Hi spirit,

Thanks for chiming in!

servers has reboot because you use HA, and you loose quorum for too much time.
I wondered about that, but could not find it in the docs (and did not know what to search for).

How could you recognize the problem, is it the retries on the messages or the sync/membership changes like the snippet below?
Nov 01 10:39:31 srv1-data-center corosync[1433]: [QUORUM] Sync members[3]: 1 2 3
Nov 01 10:39:31 srv1-data-center corosync[1433]: [QUORUM] Sync joined[2]: 2 3
Nov 01 10:39:31 srv1-data-center corosync[1433]: [QUORUM] Sync left[2]: 2 3
Nov 01 10:39:31 srv1-data-center corosync[1433]: [TOTEM ] A new membership (1.164e7) was formed. Members joined: 2 3 left: 2 3
Nov 01 10:39:31 srv1-data-center corosync[1433]: [TOTEM ] Failed to receive the leave message. failed: 2 3
Nov 01 10:39:31 srv1-data-center corosync[1433]: [KNET ] rx: host: 3 link: 1 is up
Nov 01 10:39:31 srv1-data-center corosync[1433]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)

Is the timeout configurable, or is it a fixed period?

network problems, but I would learn something new if that turned out to be causing the synchronous reboot.
At least I learned something new today! :D
 
servers has reboot because you use HA, and you loose quorum for too much time.

Do you manage network between buildings ? Personally, I'll avoid to use HA with metro-cluster on differents building, until you have dedicated fibers for you proxmox corosync netwoork.

You could have network problem, link saturation (you should really never have bandwith saturation or you'll have problem like this).
Not the OP, but thanks for that, I also would like to find any more details on this behaviour as I was testing 2-node (I know, but it was just testing) cluster and it was seemingly randomly rebooting (surprise surprise, everytime i was doing some maintenance on one of the 2 nodes. Once I was setting pvecm expected 1 for the period it stopped happening so I figured the reboots of the "healthy" node were by design, but I wish it was somewhere documented.

Later on, however, when I had one of the nodes freeze, the other one just kept hanging there, of course it lost quorum, but it never rebooted. Is this described anywhere properly?
 
  • Like
Reactions: wbk
Hi José,

Sorry WBK I'm not an expert on this kind of thing as I'm just learning how to work with this.
No need to apologize, most of us here are learning :) The best way to learn may be education, but doing is a good second. Asking someone to solve the problem without having a look yourself takes away the chance to learn.

Pay special attention to spirit's message: I think he solved the mystery of the booting servers, even without the logs from /var/log ;-)
 
servers has reboot because you use HA, and you loose quorum for too much time.

Do you manage network between buildings ? Personally, I'll avoid to use HA with metro-cluster on differents building, until you have dedicated fibers for you proxmox corosync netwoork.

You could have network problem, link saturation (you should really never have bandwith saturation or you'll have problem like this).
I don't know if I understood well, since my English is not very good (I spoke Spanish).
Currently the three buildings are connected by fiber optic cables.
 
Currently the three buildings are connected by fiber optic cables.
Is it a direct connection from one building to the other, or is it fiber optic via the Internet? I think @spirit means: a direct cable from one building to the other, only for Proxmox node communication (so not via the Internet, and not shared with other services).

What are the ping response times between buildings? Are they influenced by other traffic, either normal user actions, backup scripts, or people on the Internet watching a TV program all at the same time?
 
So the corosync messages are indicative of the following:
https://github.com/corosync/corosync/issues/622

Short version also here:
https://www.suse.com/support/kb/doc/?id=000020407

I would wonder if there's any firewalls w/ IPS or IDS in place, or as you mentioned - that had been added in the last 4 weeks - between the networks or it is indeed dedicated connection. Checking with just ICMP might not show any of that when it comes to latency or jitter.

Do you mind sharing output of # corosync-cmapctl | grep 'runtime.config.totem'

And the totem {} section of the /etc/corosync/corosync.conf?

(You might want to stop using HA setup for the time being until you can solve the QoS of the network.)
 
Last edited by a moderator:
So the corosync messages are indicative of the following:
https://github.com/corosync/corosync/issues/622

Short version also here:
https://www.suse.com/support/kb/doc/?id=000020407

I would wonder if there's any firewalls w/ IPS or IDS in place, or as you mentioned - that had been added in the last 4 weeks - between the networks or it is indeed dedicated connection. Checking with just ICMP might not show any of that when it comes to latency or jitter.

Do you mind sharing output of # corosync-cmapctl | grep 'runtime.config.totem'

And the totem {} section of the /etc/corosync/corosync.conf?

(You might want to stop using HA setup for the time being until you can solve the QoS of the network.)
Hi Esiy.

Here is the output:

corosync-cmapctl | grep 'runtime.config.totem'
runtime.config.totem.block_unlisted_ips (u32) = 1
runtime.config.totem.cancel_token_hold_on_retransmit (u32) = 0
runtime.config.totem.consensus (u32) = 4380
runtime.config.totem.downcheck (u32) = 1000
runtime.config.totem.fail_recv_const (u32) = 2500
runtime.config.totem.heartbeat_failures_allowed (u32) = 0
runtime.config.totem.hold (u32) = 685
runtime.config.totem.interface.0.knet_ping_interval (u32) = 912
runtime.config.totem.interface.0.knet_ping_timeout (u32) = 1825
runtime.config.totem.interface.1.knet_ping_interval (u32) = 912
runtime.config.totem.interface.1.knet_ping_timeout (u32) = 1825
runtime.config.totem.join (u32) = 50
runtime.config.totem.knet_compression_level (i32) = 0
runtime.config.totem.knet_compression_model (str) = none
runtime.config.totem.knet_compression_threshold (u32) = 0
runtime.config.totem.knet_pmtud_interval (u32) = 30
runtime.config.totem.max_messages (u32) = 17
runtime.config.totem.max_network_delay (u32) = 50
runtime.config.totem.merge (u32) = 200
runtime.config.totem.miss_count_const (u32) = 5
runtime.config.totem.send_join (u32) = 0
runtime.config.totem.seqno_unchanged_const (u32) = 30
runtime.config.totem.token (u32) = 3650
runtime.config.totem.token_retransmit (u32) = 869
runtime.config.totem.token_retransmits_before_loss_const (u32) = 4
runtime.config.totem.token_warning (u32) = 75
runtime.config.totem.window_size (u32) = 50

And TOTEM section:

totem {
cluster_name: CLUSTER-EMPRESA
config_version: 3
interface {
linknumber: 0
}
interface {
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
 
Is it a direct connection from one building to the other, or is it fiber optic via the Internet? I think @spirit means: a direct cable from one building to the other, only for Proxmox node communication (so not via the Internet, and not shared with other services).

What are the ping response times between buildings? Are they influenced by other traffic, either normal user actions, backup scripts, or people on the Internet watching a TV program all at the same time?
Hi.

I have a server in the central part of the company and it has two network cables that connect to a GigaBit port on a Cisco switch and a connection cable comes out from there.
fiber optic that connects to a fiber optic switch Cisco.
The other two servers are in two different buildings and the connection theme is identical to that of the first server.

Note: The other day I realized that the fiber optic switch restarted for no apparent reason.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!