Random Reboot on Proxmox VE 8.1

Jan 23, 2023
4
0
1
Hi,

I've observed a seemingly random reboot of one of our main Proxmox nodes. I couldn't find a reason for the reboot in the logs.

The node was set up 4 days ago with a clean install and previously ran Proxmox VE 7.4 without any issues, with an uptime of multiple months between reboots for updates. I have attached the syslog output and a pvereport, I'd be happy if someone could take a look.

Best regards
 

Attachments

Hello,

Thank you for the information!

From the syslog you've provided, I can see that the corosync lost connection :
Code:
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] link: host: 5 link: 0 is down
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] host: host: 5 has no active links
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] link: Resetting MTU for link 0 because host 5 joined
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 14 10:58:10 pve-main2 corosync[4231]:   [KNET  ] rx: host: 3 link: 0 is up
Mar 14 10:58:10 pve-main2 corosync[4231]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Mar 14 10:58:10 pve-main2 corosync[4231]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Mar 14 10:58:10 pve-main2 corosync[4231]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 14 10:58:12 pve-main2 corosync[4231]:   [KNET  ] rx: host: 1 link: 0 is up
Mar 14 10:58:12 pve-main2 corosync[4231]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Mar 14 10:58:12 pve-main2 corosync[4231]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar 14 10:58:12 pve-main2 corosync[4231]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 14 10:58:14 pve-main2 pvescheduler[512253]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Mar 14 10:58:14 pve-main2 pvescheduler[512254]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Sync members[1]: 4
Mar 14 10:58:21 pve-main2 corosync[4231]:   [TOTEM ] A new membership (4.4467) was formed. Members
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Sync members[3]: 2 4 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Sync joined[2]: 2 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [TOTEM ] A new membership (2.446b) was formed. Members joined: 2 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Sync members[5]: 1 2 3 4 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Sync joined[4]: 1 2 3 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [TOTEM ] A new membership (1.446f) was formed. Members joined: 1 2 3 left: 2
Mar 14 10:58:21 pve-main2 corosync[4231]:   [TOTEM ] Failed to receive the leave message. failed: 2
Mar 14 10:58:21 pve-main2 corosync[4231]:   [TOTEM ] Retransmit List: 1
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [status] notice: members: 1/3226, 2/1614, 3/1492, 4/4235, 5/1237
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [status] notice: starting data syncronisation
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] This node is within the primary component and will provide service.
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Members[5]: 1 2 3 4 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [status] notice: node has quorum
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [dcdb] notice: members: 1/3226, 2/1614, 3/1492, 4/4235, 5/1237
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [dcdb] notice: starting data syncronisation
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [dcdb] notice: received sync request (epoch 1/3226/00000051)
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [dcdb] notice: received sync request (epoch 1/3226/00000052)
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [status] notice: received sync request (epoch 1/3226/0000003A)
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [dcdb] notice: received sync request (epoch 1/3226/00000053)
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [status] notice: received sync request (epoch 1/3226/0000003B)

And by looking at the `corosync.conf` and network configuration I see you have only one vmbr0 for the VMs and Corosync, where we recommend having one dedicated NIC for Corosync management, plus one for redundancy. In this case, I would add a ring_1 to the Corosync config, as described in our documentation [0].

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
 
Hi Moayad,

thanks for your reply. I will change the configuration to use a dedicated interface for corosync. The connection loss might be due to some network configuration changes I was doing, resulting in a short interruption.

That shouldn't trigger a reboot though, if I'm correct?

Best regards
 
Hi,

Corosync does not need a lot of bandwidth, but low latency. Keeping the other Corosync links is good as fallback should the dedicated Corosync have an issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!