Proxmox Nodes are auto rebooting

kiranch97 · Jul 10, 2024

Hi Team,

we are having 8 node promox production cluster, running with 7.4-17/513c62be version, nodes are getting auto rebooted randomly, on random days.
Can you please advise to find root cause of this.

aaron · Jul 10, 2024

A few more details would be good.

HA enabled? How many Corosync links are configured? Do they share the physical network with other services?

kiranch97 · Jul 10, 2024

Hi @aaron
Thanks for your reply,

Yes, HA is enabled.

We have two Corosync links are configured,

1. we have 4 uplinks on each server, 10G
2. 3 bonds are created (with LACP)
2.1 Three bridge networks for MGMT, Storage replication, Client traffic
3.1 one Corosync link is monitoring through the MGMT bridge network
3.2 second corosync link is created as VLAN Interfaces on the client bridge.

esi_y · Jul 10, 2024

kiranch97 said:
Yes, HA is enabled.
We have two Corosync links are configured,

Do you mind enclosing:

journalctl -b -1 -u pveproxy -u pvedaemon -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux > attachment.txt

... from one of the nodes that rebooted and the same log of the same time period from the other nodes? You can use e.g. --since "2024-07-09" --until "2024-07-10 04:00" switches.

Or simply attach all logs of the same period from all nodes during which at least one of them rebooted.

sw-omit · Jul 10, 2024

Also, since you never mention it just to confirm:
Do you have a vote deamon installed/configured off-proxmox, since you're using an even-numbered cluster?
https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_external_vote_support

kiranch97 · Jul 10, 2024

Jul 10 03:52:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:53:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:54:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:55:01 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:55:01 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:55:05 lagoon2 pvedaemon[4186835]: <root@pam> successful auth for user 'root@pam'
Jul 10 03:55:08 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:55:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:56:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:57:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:57:43 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:58:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:59:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:00:02 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:00:02 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:00:02 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:00:06 lagoon2 pvedaemon[4186835]: <root@pam> successful auth for user 'root@pam'
Jul 10 04:00:10 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:00:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:00:45 lagoon2 pvedaemon[454449]: <root@pam> successful auth for user 'root@pam'
Jul 10 04:00:45 lagoon2 pveproxy[1463004]: Clearing outdated entries from certificate cache
Jul 10 04:00:45 lagoon2 pveproxy[1451496]: Clearing outdated entries from certificate cache
Jul 10 04:00:45 lagoon2 pveproxy[1478979]: Clearing outdated entries from certificate cache
Jul 10 04:00:56 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:01:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:01:50 lagoon2 corosync[3216]: [TOTEM ] Retransmit List: 1a2578
Jul 10 04:02:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:03:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:04:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:05:02 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:05:02 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:05:06 lagoon2 pvedaemon[454449]: <root@pam> successful auth for user 'root@pam'
Jul 10 04:05:09 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:05:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:06:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:07:24 lagoon2 pmxcfs[3039]: [status] notice: received log
root@lagoon2:~#

Incident triggered at 4:07

esi_y · Jul 10, 2024

kiranch97 said:
Code:

Jul 10 04:07:24 lagoon2 pmxcfs[3039]: [status] notice: received log root@lagoon2:~#

Incident triggered at 4:07

Can you share the whole end (journalctl -b -1 without -u's of the rebooted node's log)? Also, it would really help to see the logs of select services of all of the other nodes for several hours prior to the incident. There's a button below your textarea to attach as a file for your convenience. Feel free to sanitize, but it should not contain anything sensitive with the -u switches already applied.

Proxmox Nodes are auto rebooting

kiranch97

Member

aaron

Proxmox Staff Member

kiranch97

Member

esi_y

Renowned Member

sw-omit

Well-Known Member

kiranch97

Member

esi_y

Renowned Member

We value your privacy