Node on Proxmox Cluster Reboots

Dec 14, 2020
11
0
6
32
Hello Support,

We have a cluster setup of two proxmox nodes (pve and pve2).
The pve node just rebooted on its own. (Time of reboot: 12:43 pm GMT )
What could be the cause of this reboots?
This issue keeps happening from time to time.
I have attached syslog file from the pve for investigations and assistance.

Regards,
Edmund
support@delaphonegh.com
 

Attachments

  • syslog.txt
    1,008.5 KB · Views: 16
Do you have HA resources defined on that node?
Code:
Mar  4 12:31:04 pve corosync[7369]:   [KNET  ] link: host: 1 link: 0 is down
Mar  4 12:31:04 pve corosync[7369]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar  4 12:31:04 pve corosync[7369]:   [KNET  ] host: host: 1 has no active links
Mar  4 12:31:05 pve corosync[7369]:   [TOTEM ] Token has not been received in 750 ms
Mar  4 12:31:05 pve corosync[7369]:   [TOTEM ] A processor failed, forming new configuration.
Mar  4 12:31:05 pve kernel: [1066387.177412] igb 0000:01:00.2 eno3: igb: eno3 NIC Link is Down
Mar  4 12:31:05 pve kernel: [1066387.180055] vmbr1: port 1(eno3) entered disabled state
Mar  4 12:31:06 pve corosync[7369]:   [TOTEM ] A new membership (2:524) was formed. Members left: 1
Mar  4 12:31:06 pve corosync[7369]:   [TOTEM ] Failed to receive the leave message. failed: 1
Mar  4 12:31:06 pve corosync[7369]:   [CPG   ] downlist left_list: 1 received
Mar  4 12:31:06 pve pmxcfs[7118]: [dcdb] notice: members: 2/7118
Mar  4 12:31:06 pve pmxcfs[7118]: [status] notice: members: 2/7118
Mar  4 12:31:06 pve pmxcfs[7118]: [status] notice: node lost quorum
Mar  4 12:31:06 pve corosync[7369]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar  4 12:31:06 pve corosync[7369]:   [QUORUM] Members[1]: 2
Mar  4 12:31:06 pve corosync[7369]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar  4 12:31:07 pve pve-ha-crm[7927]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Mar  4 12:31:08 pve pve-ha-lrm[7935]: lost lock 'ha_agent_pve_lock - cfs lock update failed - Permission denied
Mar  4 12:31:12 pve pve-ha-crm[7927]: status change master => lost_manager_lock
Mar  4 12:31:12 pve pve-ha-crm[7927]: watchdog closed (disabled)
Mar  4 12:31:12 pve pve-ha-crm[7927]: status change lost_manager_lock => wait_for_quorum
Mar  4 12:31:13 pve pve-ha-lrm[7935]: status change active => lost_agent_lock
Mar  4 12:31:27 pve kernel: [1066409.101251] igb 0000:01:00.2 eno3: igb: eno3 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar  4 12:31:27 pve kernel: [1066409.103119] vmbr1: port 1(eno3) entered blocking state
Mar  4 12:31:27 pve kernel: [1066409.104691] vmbr1: port 1(eno3) entered forwarding state
Mar  4 12:31:59 pve watchdog-mux[6353]: client watchdog expired - disable watchdog updates
Looks like your NIC went down and there was no connection to the other node. This means it lost quorum and could not update the watchdog that is enabled once a HA resource is defined on that node.
Once the watchdog expires, the node will be fenced which explains the sudden reboot.

Also update to the latest version! You're running a very old version of PVE 6 and there have been some bugfixes especially related to corosync and libknet1.
 
Last edited:
Do you have HA resources defined on that node?
Code:
Mar  4 12:31:04 pve corosync[7369]:   [KNET  ] link: host: 1 link: 0 is down
Mar  4 12:31:04 pve corosync[7369]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar  4 12:31:04 pve corosync[7369]:   [KNET  ] host: host: 1 has no active links
Mar  4 12:31:05 pve corosync[7369]:   [TOTEM ] Token has not been received in 750 ms
Mar  4 12:31:05 pve corosync[7369]:   [TOTEM ] A processor failed, forming new configuration.
Mar  4 12:31:05 pve kernel: [1066387.177412] igb 0000:01:00.2 eno3: igb: eno3 NIC Link is Down
Mar  4 12:31:05 pve kernel: [1066387.180055] vmbr1: port 1(eno3) entered disabled state
Mar  4 12:31:06 pve corosync[7369]:   [TOTEM ] A new membership (2:524) was formed. Members left: 1
Mar  4 12:31:06 pve corosync[7369]:   [TOTEM ] Failed to receive the leave message. failed: 1
Mar  4 12:31:06 pve corosync[7369]:   [CPG   ] downlist left_list: 1 received
Mar  4 12:31:06 pve pmxcfs[7118]: [dcdb] notice: members: 2/7118
Mar  4 12:31:06 pve pmxcfs[7118]: [status] notice: members: 2/7118
Mar  4 12:31:06 pve pmxcfs[7118]: [status] notice: node lost quorum
Mar  4 12:31:06 pve corosync[7369]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar  4 12:31:06 pve corosync[7369]:   [QUORUM] Members[1]: 2
Mar  4 12:31:06 pve corosync[7369]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar  4 12:31:07 pve pve-ha-crm[7927]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Mar  4 12:31:08 pve pve-ha-lrm[7935]: lost lock 'ha_agent_pve_lock - cfs lock update failed - Permission denied
Mar  4 12:31:12 pve pve-ha-crm[7927]: status change master => lost_manager_lock
Mar  4 12:31:12 pve pve-ha-crm[7927]: watchdog closed (disabled)
Mar  4 12:31:12 pve pve-ha-crm[7927]: status change lost_manager_lock => wait_for_quorum
Mar  4 12:31:13 pve pve-ha-lrm[7935]: status change active => lost_agent_lock
Mar  4 12:31:27 pve kernel: [1066409.101251] igb 0000:01:00.2 eno3: igb: eno3 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar  4 12:31:27 pve kernel: [1066409.103119] vmbr1: port 1(eno3) entered blocking state
Mar  4 12:31:27 pve kernel: [1066409.104691] vmbr1: port 1(eno3) entered forwarding state
Mar  4 12:31:59 pve watchdog-mux[6353]: client watchdog expired - disable watchdog updates
Looks like your NIC went down and there was no connection to the other node. This means it lost quorum and could not update the watchdog that is enabled once a HA resource is defined on that node.
Once the watchdog expires, the node will be fenced which explains the sudden reboot.

Also update to the latest version! You're running a very old version of PVE 6 and there have been some bugfixes especially related to corosync and libknet1.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!