Proxmox Nodes are auto rebooting

kiranch97

New Member
Dec 19, 2022
13
0
1
Hi Team,

we are having 8 node promox production cluster, running with 7.4-17/513c62be version, nodes are getting auto rebooted randomly, on random days.
Can you please advise to find root cause of this.
 
A few more details would be good.

HA enabled? How many Corosync links are configured? Do they share the physical network with other services?
 
Hi @aaron
Thanks for your reply,

Yes, HA is enabled.

We have two Corosync links are configured,

1. we have 4 uplinks on each server, 10G
2. 3 bonds are created (with LACP)
2.1 Three bridge networks for MGMT, Storage replication, Client traffic
3.1 one Corosync link is monitoring through the MGMT bridge network
3.2 second corosync link is created as VLAN Interfaces on the client bridge.
 
Yes, HA is enabled.
We have two Corosync links are configured,

Do you mind enclosing:

journalctl -b -1 -u pveproxy -u pvedaemon -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux > attachment.txt

... from one of the nodes that rebooted and the same log of the same time period from the other nodes? You can use e.g. --since "2024-07-09" --until "2024-07-10 04:00" switches.

Or simply attach all logs of the same period from all nodes during which at least one of them rebooted.
 
Jul 10 03:52:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:53:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:54:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:55:01 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:55:01 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:55:05 lagoon2 pvedaemon[4186835]: <root@pam> successful auth for user 'root@pam'
Jul 10 03:55:08 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:55:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:56:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:57:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:57:43 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:58:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 03:59:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:00:02 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:00:02 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:00:02 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:00:06 lagoon2 pvedaemon[4186835]: <root@pam> successful auth for user 'root@pam'
Jul 10 04:00:10 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:00:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:00:45 lagoon2 pvedaemon[454449]: <root@pam> successful auth for user 'root@pam'
Jul 10 04:00:45 lagoon2 pveproxy[1463004]: Clearing outdated entries from certificate cache
Jul 10 04:00:45 lagoon2 pveproxy[1451496]: Clearing outdated entries from certificate cache
Jul 10 04:00:45 lagoon2 pveproxy[1478979]: Clearing outdated entries from certificate cache
Jul 10 04:00:56 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:01:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:01:50 lagoon2 corosync[3216]: [TOTEM ] Retransmit List: 1a2578
Jul 10 04:02:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:03:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:04:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:05:02 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:05:02 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:05:06 lagoon2 pvedaemon[454449]: <root@pam> successful auth for user 'root@pam'
Jul 10 04:05:09 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:05:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:06:24 lagoon2 pmxcfs[3039]: [status] notice: received log
Jul 10 04:07:24 lagoon2 pmxcfs[3039]: [status] notice: received log
root@lagoon2:~#

Incident triggered at 4:07
 
Code:
Jul 10 04:07:24 lagoon2 pmxcfs[3039]: [status] notice: received log
root@lagoon2:~#

Incident triggered at 4:07

Can you share the whole end (journalctl -b -1 without -u's of the rebooted node's log)? Also, it would really help to see the logs of select services of all of the other nodes for several hours prior to the incident. There's a button below your textarea to attach as a file for your convenience. Feel free to sanitize, but it should not contain anything sensitive with the -u switches already applied.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!