Random Reboot on Proxmox VE 8.1

Jan 23, 2023
4
0
1
Hi,

I've observed a seemingly random reboot of one of our main Proxmox nodes. I couldn't find a reason for the reboot in the logs.

The node was set up 4 days ago with a clean install and previously ran Proxmox VE 7.4 without any issues, with an uptime of multiple months between reboots for updates. I have attached the syslog output and a pvereport, I'd be happy if someone could take a look.

Best regards
 

Attachments

Hello,

Thank you for the information!

From the syslog you've provided, I can see that the corosync lost connection :
Code:
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] link: host: 5 link: 0 is down
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] host: host: 5 has no active links
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] link: Resetting MTU for link 0 because host 5 joined
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 14 10:58:10 pve-main2 corosync[4231]:   [KNET  ] rx: host: 3 link: 0 is up
Mar 14 10:58:10 pve-main2 corosync[4231]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Mar 14 10:58:10 pve-main2 corosync[4231]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Mar 14 10:58:10 pve-main2 corosync[4231]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 14 10:58:12 pve-main2 corosync[4231]:   [KNET  ] rx: host: 1 link: 0 is up
Mar 14 10:58:12 pve-main2 corosync[4231]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Mar 14 10:58:12 pve-main2 corosync[4231]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar 14 10:58:12 pve-main2 corosync[4231]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 14 10:58:14 pve-main2 pvescheduler[512253]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Mar 14 10:58:14 pve-main2 pvescheduler[512254]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Sync members[1]: 4
Mar 14 10:58:21 pve-main2 corosync[4231]:   [TOTEM ] A new membership (4.4467) was formed. Members
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Sync members[3]: 2 4 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Sync joined[2]: 2 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [TOTEM ] A new membership (2.446b) was formed. Members joined: 2 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Sync members[5]: 1 2 3 4 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Sync joined[4]: 1 2 3 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [TOTEM ] A new membership (1.446f) was formed. Members joined: 1 2 3 left: 2
Mar 14 10:58:21 pve-main2 corosync[4231]:   [TOTEM ] Failed to receive the leave message. failed: 2
Mar 14 10:58:21 pve-main2 corosync[4231]:   [TOTEM ] Retransmit List: 1
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [status] notice: members: 1/3226, 2/1614, 3/1492, 4/4235, 5/1237
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [status] notice: starting data syncronisation
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] This node is within the primary component and will provide service.
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Members[5]: 1 2 3 4 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [status] notice: node has quorum
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [dcdb] notice: members: 1/3226, 2/1614, 3/1492, 4/4235, 5/1237
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [dcdb] notice: starting data syncronisation
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [dcdb] notice: received sync request (epoch 1/3226/00000051)
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [dcdb] notice: received sync request (epoch 1/3226/00000052)
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [status] notice: received sync request (epoch 1/3226/0000003A)
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [dcdb] notice: received sync request (epoch 1/3226/00000053)
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [status] notice: received sync request (epoch 1/3226/0000003B)

And by looking at the `corosync.conf` and network configuration I see you have only one vmbr0 for the VMs and Corosync, where we recommend having one dedicated NIC for Corosync management, plus one for redundancy. In this case, I would add a ring_1 to the Corosync config, as described in our documentation [0].

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
 
Hi Moayad,

thanks for your reply. I will change the configuration to use a dedicated interface for corosync. The connection loss might be due to some network configuration changes I was doing, resulting in a short interruption.

That shouldn't trigger a reboot though, if I'm correct?

Best regards
 
Hi,

Corosync does not need a lot of bandwidth, but low latency. Keeping the other Corosync links is good as fallback should the dedicated Corosync have an issue.