Random Reboot on Proxmox VE 8.1

fmielicki · Mar 14, 2024

Hi,

I've observed a seemingly random reboot of one of our main Proxmox nodes. I couldn't find a reason for the reboot in the logs.

The node was set up 4 days ago with a clean install and previously ran Proxmox VE 7.4 without any issues, with an uptime of multiple months between reboots for updates. I have attached the syslog output and a pvereport, I'd be happy if someone could take a look.

Best regards

Moayad · Mar 15, 2024

Hello,

Thank you for the information!

From the syslog you've provided, I can see that the corosync lost connection :

Code:

Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] link: host: 5 link: 0 is down
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] host: host: 5 has no active links
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] link: Resetting MTU for link 0 because host 5 joined
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Mar 14 10:58:09 pve-main2 corosync[4231]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 14 10:58:10 pve-main2 corosync[4231]:   [KNET  ] rx: host: 3 link: 0 is up
Mar 14 10:58:10 pve-main2 corosync[4231]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Mar 14 10:58:10 pve-main2 corosync[4231]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Mar 14 10:58:10 pve-main2 corosync[4231]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 14 10:58:12 pve-main2 corosync[4231]:   [KNET  ] rx: host: 1 link: 0 is up
Mar 14 10:58:12 pve-main2 corosync[4231]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Mar 14 10:58:12 pve-main2 corosync[4231]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar 14 10:58:12 pve-main2 corosync[4231]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 14 10:58:14 pve-main2 pvescheduler[512253]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Mar 14 10:58:14 pve-main2 pvescheduler[512254]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Sync members[1]: 4
Mar 14 10:58:21 pve-main2 corosync[4231]:   [TOTEM ] A new membership (4.4467) was formed. Members
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Sync members[3]: 2 4 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Sync joined[2]: 2 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [TOTEM ] A new membership (2.446b) was formed. Members joined: 2 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Sync members[5]: 1 2 3 4 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Sync joined[4]: 1 2 3 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [TOTEM ] A new membership (1.446f) was formed. Members joined: 1 2 3 left: 2
Mar 14 10:58:21 pve-main2 corosync[4231]:   [TOTEM ] Failed to receive the leave message. failed: 2
Mar 14 10:58:21 pve-main2 corosync[4231]:   [TOTEM ] Retransmit List: 1
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [status] notice: members: 1/3226, 2/1614, 3/1492, 4/4235, 5/1237
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [status] notice: starting data syncronisation
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] This node is within the primary component and will provide service.
Mar 14 10:58:21 pve-main2 corosync[4231]:   [QUORUM] Members[5]: 1 2 3 4 5
Mar 14 10:58:21 pve-main2 corosync[4231]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [status] notice: node has quorum
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [dcdb] notice: members: 1/3226, 2/1614, 3/1492, 4/4235, 5/1237
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [dcdb] notice: starting data syncronisation
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [dcdb] notice: received sync request (epoch 1/3226/00000051)
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [dcdb] notice: received sync request (epoch 1/3226/00000052)
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [status] notice: received sync request (epoch 1/3226/0000003A)
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [dcdb] notice: received sync request (epoch 1/3226/00000053)
Mar 14 10:58:21 pve-main2 pmxcfs[4235]: [status] notice: received sync request (epoch 1/3226/0000003B)

And by looking at the `corosync.conf` and network configuration I see you have only one vmbr0 for the VMs and Corosync, where we recommend having one dedicated NIC for Corosync management, plus one for redundancy. In this case, I would add a ring_1 to the Corosync config, as described in our documentation [0].

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

fmielicki · Mar 19, 2024

Hi Moayad,

thanks for your reply. I will change the configuration to use a dedicated interface for corosync. The connection loss might be due to some network configuration changes I was doing, resulting in a short interruption.

That shouldn't trigger a reboot though, if I'm correct?

Best regards

Moayad · Mar 22, 2024

Hi,

Corosync does not need a lot of bandwidth, but low latency. Keeping the other Corosync links is good as fallback should the dedicated Corosync have an issue.

Search

Search

Random Reboot on Proxmox VE 8.1

fmielicki

New Member

Attachments

Moayad

Proxmox Staff Member

fmielicki

New Member

Moayad

Proxmox Staff Member

We value your privacy