Hi José,
Welcome to the forums!
My first reaction would be: troubleshoot!
What are the circumstances under which the servers reboot? Do they restart all three at the same time, or one time the one and then another day the other? Which causes have you checked and excluded? How long have the servers been running without problems? Having three servers develop an instability at the same time is curious.
In my personal experience, unexpected reboots are mostly due to hardware faults or incompatibilities, power disturbances, kernel/driver issues with "exotic" hardware, or configuration choices. This is not Proxmox specific, of course. More experienced users can probably add a few ;-)
Let me give a possible example of each category:
- Hardware faults and incompatibilities: I had some faulty RAM, that would crash my system if load became above a certain threshold. I guess there also had to run a critical process in the faulty area. This is a bit far fetched for three systems at once, but if load historically only has been 50% and started rising lately, leading to a swap of RAM, it could be a possibility. Did you run a memory check?
- Power disturbances: I've had undervolted and overvolted systems misbehave in private sphere, not in "professional IT" environments. If all three servers reboot at the same time, I'd keep an eye on voltage drops (or spikes).
- Kernel/driver issues with "exotic" hardware: not really that exotic, but mostly hardware of vendors that give to few details of their product to have a reliably working open source driver. Regular use cases are often reverse-engineered, but edge cases can lead to kernel panic.
- Configuration choices: running with ample RAM and no swap works, as long as there is enough RAM to keep the OoM-killer at bay. You have swap configured, I suppose?
These are just some examples. What information do the logs give you? With a cluster and high availability CEPH running, do you have monitoring and an external log server that can hold a clue?
Hi wbk!
Thanks for answer to me.
I found this at logs of my three servers:
Nov 01 10:39:29 srv3-resonancia corosync[1437]: [TOTEM ] Retransmit List: 3 5
Nov 01 10:39:30 srv3-resonancia pmxcfs[1430]: [status] notice: cpg_send_message retry 50
Nov 01 10:39:30 srv3-resonancia corosync[1437]: [TOTEM ] Retransmit List: 3
Nov 01 10:39:31 srv3-resonancia pmxcfs[1430]: [status] notice: cpg_send_message retry 60
Nov 01 10:39:31 srv3-resonancia corosync[1437]: [QUORUM] Sync members[3]: 1 2 3
Nov 01 10:39:31 srv3-resonancia corosync[1437]: [QUORUM] Sync joined[1]: 1
Nov 01 10:39:31 srv3-resonancia corosync[1437]: [QUORUM] Sync left[1]: 1
Nov 01 10:39:31 srv3-resonancia corosync[1437]: [TOTEM ] A new membership (1.164e7) was formed. Members joined: 1 left: 1
Nov 01 10:39:31 srv3-resonancia corosync[1437]: [TOTEM ] Failed to receive the leave message. failed: 1
Nov 01 10:39:32 srv3-resonancia pmxcfs[1430]: [status] notice: cpg_send_message retry 70
Nov 01 10:39:33 srv3-resonancia pmxcfs[1430]: [status] notice: cpg_send_message retry 80
Nov 01 10:39:33 srv3-resonancia corosync[1437]: [KNET ] rx: host: 2 link: 1 is up
Nov 01 10:39:33 srv3-resonancia corosync[1437]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
-- Reboot --
Nov 01 10:39:22 srv2-tecnologia pvescheduler[598497]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Nov 01 10:39:23 srv2-tecnologia corosync[1428]: [KNET ] link: host: 3 link: 1 is down
Nov 01 10:39:23 srv2-tecnologia corosync[1428]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 01 10:39:23 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 10
Nov 01 10:39:24 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 20
Nov 01 10:39:25 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 30
Nov 01 10:39:26 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 40
Nov 01 10:39:27 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 50
Nov 01 10:39:28 srv2-tecnologia corosync[1428]: [KNET ] rx: host: 3 link: 1 is up
Nov 01 10:39:28 srv2-tecnologia corosync[1428]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 01 10:39:28 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 60
Nov 01 10:39:28 srv2-tecnologia watchdog-mux[1092]: client watchdog expired - disable watchdog updates
Nov 01 10:39:29 srv2-tecnologia corosync[1428]: [TOTEM ] Token has not been received in 2737 ms
Nov 01 10:39:29 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 70
Nov 01 10:39:29 srv2-tecnologia corosync[1428]: [TOTEM ] Retransmit List: 3 4 5
Nov 01 10:39:30 srv2-tecnologia corosync[1428]: [TOTEM ] Retransmit List: 3 5
Nov 01 10:39:30 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 80
Nov 01 10:39:31 srv2-tecnologia corosync[1428]: [QUORUM] Sync members[3]: 1 2 3
Nov 01 10:39:31 srv2-tecnologia corosync[1428]: [QUORUM] Sync joined[1]: 1
Nov 01 10:39:31 srv2-tecnologia corosync[1428]: [QUORUM] Sync left[1]: 1
Nov 01 10:39:31 srv2-tecnologia corosync[1428]: [TOTEM ] A new membership (1.164e7) was formed. Members joined: 1 left: 1
Nov 01 10:39:31 srv2-tecnologia corosync[1428]: [TOTEM ] Failed to receive the leave message. failed: 1
Nov 01 10:39:31 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 90
Nov 01 10:39:32 srv2-tecnologia corosync[1428]: [KNET ] link: host: 3 link: 1 is down
Nov 01 10:39:32 srv2-tecnologia corosync[1428]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 01 10:39:32 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 100
Nov 01 10:39:32 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retried 100 times
Nov 01 10:39:32 srv2-tecnologia pmxcfs[1422]: [status] crit: cpg_send_message failed: 6
Nov 01 10:39:32 srv2-tecnologia pve-firewall[1488]: firewall update time (11.302 seconds)
Nov 01 10:39:33 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 10
Nov 01 10:39:34 srv2-tecnologia corosync[1428]: [KNET ] link: host: 3 link: 0 is down
Nov 01 10:39:34 srv2-tecnologia corosync[1428]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 01 10:39:34 srv2-tecnologia corosync[1428]: [KNET ] host: host: 3 has no active links
Nov 01 10:39:34 srv2-tecnologia pmxcfs[1422]: [status] notice: cpg_send_message retry 20
-- Reboot --
Nov 01 10:39:24 srv1-data-center pvescheduler[598831]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Nov 01 10:39:24 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 70
Nov 01 10:39:25 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 10
Nov 01 10:39:25 srv1-data-center corosync[1433]: [TOTEM ] Retransmit List: 9 a
Nov 01 10:39:25 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 80
Nov 01 10:39:26 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 20
Nov 01 10:39:26 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 90
Nov 01 10:39:27 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 30
Nov 01 10:39:27 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 100
Nov 01 10:39:27 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retried 100 times
Nov 01 10:39:27 srv1-data-center pmxcfs[1365]: [status] crit: cpg_send_message failed: 6
Nov 01 10:39:27 srv1-data-center pve-firewall[1493]: firewall update time (11.308 seconds)
Nov 01 10:39:28 srv1-data-center corosync[1433]: [TOTEM ] Token has not been received in 2737 ms
Nov 01 10:39:28 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 40
Nov 01 10:39:28 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 10
Nov 01 10:39:28 srv1-data-center watchdog-mux[1097]: client watchdog expired - disable watchdog updates
Nov 01 10:39:29 srv1-data-center corosync[1433]: [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Nov 01 10:39:29 srv1-data-center corosync[1433]: [KNET ] link: host: 3 link: 1 is down
Nov 01 10:39:29 srv1-data-center corosync[1433]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 01 10:39:29 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 50
Nov 01 10:39:29 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 20
Nov 01 10:39:30 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 60
Nov 01 10:39:30 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 30
Nov 01 10:39:31 srv1-data-center corosync[1433]: [QUORUM] Sync members[3]: 1 2 3
Nov 01 10:39:31 srv1-data-center corosync[1433]: [QUORUM] Sync joined[2]: 2 3
Nov 01 10:39:31 srv1-data-center corosync[1433]: [QUORUM] Sync left[2]: 2 3
Nov 01 10:39:31 srv1-data-center corosync[1433]: [TOTEM ] A new membership (1.164e7) was formed. Members joined: 2 3 left: 2 3
Nov 01 10:39:31 srv1-data-center corosync[1433]: [TOTEM ] Failed to receive the leave message. failed: 2 3
Nov 01 10:39:31 srv1-data-center corosync[1433]: [KNET ] rx: host: 3 link: 1 is up
Nov 01 10:39:31 srv1-data-center corosync[1433]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 01 10:39:31 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 70
Nov 01 10:39:31 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 40
Nov 01 10:39:32 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 80
Nov 01 10:39:32 srv1-data-center pmxcfs[1365]: [status] notice: cpg_send_message retry 50
Nov 01 10:39:33 srv1-data-center pmxcfs[1365]: [dcdb] notice: cpg_send_message retry 90
-- Reboot --
What Can I do??