Node on Proxmox Cluster Reboots

Edmund Fiadzo · Mar 4, 2021

Hello Support,

We have a cluster setup of two proxmox nodes (pve and pve2).
The pve node just rebooted on its own. (Time of reboot: 12:43 pm GMT )
What could be the cause of this reboots?
This issue keeps happening from time to time.
I have attached syslog file from the pve for investigations and assistance.

Regards,
Edmund
support@delaphonegh.com

mira · Mar 5, 2021

Do you have HA resources defined on that node?

Code:

Mar  4 12:31:04 pve corosync[7369]:   [KNET  ] link: host: 1 link: 0 is down
Mar  4 12:31:04 pve corosync[7369]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar  4 12:31:04 pve corosync[7369]:   [KNET  ] host: host: 1 has no active links
Mar  4 12:31:05 pve corosync[7369]:   [TOTEM ] Token has not been received in 750 ms
Mar  4 12:31:05 pve corosync[7369]:   [TOTEM ] A processor failed, forming new configuration.
Mar  4 12:31:05 pve kernel: [1066387.177412] igb 0000:01:00.2 eno3: igb: eno3 NIC Link is Down
Mar  4 12:31:05 pve kernel: [1066387.180055] vmbr1: port 1(eno3) entered disabled state
Mar  4 12:31:06 pve corosync[7369]:   [TOTEM ] A new membership (2:524) was formed. Members left: 1
Mar  4 12:31:06 pve corosync[7369]:   [TOTEM ] Failed to receive the leave message. failed: 1
Mar  4 12:31:06 pve corosync[7369]:   [CPG   ] downlist left_list: 1 received
Mar  4 12:31:06 pve pmxcfs[7118]: [dcdb] notice: members: 2/7118
Mar  4 12:31:06 pve pmxcfs[7118]: [status] notice: members: 2/7118
Mar  4 12:31:06 pve pmxcfs[7118]: [status] notice: node lost quorum
Mar  4 12:31:06 pve corosync[7369]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar  4 12:31:06 pve corosync[7369]:   [QUORUM] Members[1]: 2
Mar  4 12:31:06 pve corosync[7369]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar  4 12:31:07 pve pve-ha-crm[7927]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Mar  4 12:31:08 pve pve-ha-lrm[7935]: lost lock 'ha_agent_pve_lock - cfs lock update failed - Permission denied
Mar  4 12:31:12 pve pve-ha-crm[7927]: status change master => lost_manager_lock
Mar  4 12:31:12 pve pve-ha-crm[7927]: watchdog closed (disabled)
Mar  4 12:31:12 pve pve-ha-crm[7927]: status change lost_manager_lock => wait_for_quorum
Mar  4 12:31:13 pve pve-ha-lrm[7935]: status change active => lost_agent_lock
Mar  4 12:31:27 pve kernel: [1066409.101251] igb 0000:01:00.2 eno3: igb: eno3 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar  4 12:31:27 pve kernel: [1066409.103119] vmbr1: port 1(eno3) entered blocking state
Mar  4 12:31:27 pve kernel: [1066409.104691] vmbr1: port 1(eno3) entered forwarding state
Mar  4 12:31:59 pve watchdog-mux[6353]: client watchdog expired - disable watchdog updates

Looks like your NIC went down and there was no connection to the other node. This means it lost quorum and could not update the watchdog that is enabled once a HA resource is defined on that node.
Once the watchdog expires, the node will be fenced which explains the sudden reboot.

Also update to the latest version! You're running a very old version of PVE 6 and there have been some bugfixes especially related to corosync and libknet1.

Edmund Fiadzo · Mar 5, 2021

mira said:

Do you have HA resources defined on that node?

Code:

Mar  4 12:31:04 pve corosync[7369]:   [KNET  ] link: host: 1 link: 0 is down
Mar  4 12:31:04 pve corosync[7369]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar  4 12:31:04 pve corosync[7369]:   [KNET  ] host: host: 1 has no active links
Mar  4 12:31:05 pve corosync[7369]:   [TOTEM ] Token has not been received in 750 ms
Mar  4 12:31:05 pve corosync[7369]:   [TOTEM ] A processor failed, forming new configuration.
Mar  4 12:31:05 pve kernel: [1066387.177412] igb 0000:01:00.2 eno3: igb: eno3 NIC Link is Down
Mar  4 12:31:05 pve kernel: [1066387.180055] vmbr1: port 1(eno3) entered disabled state
Mar  4 12:31:06 pve corosync[7369]:   [TOTEM ] A new membership (2:524) was formed. Members left: 1
Mar  4 12:31:06 pve corosync[7369]:   [TOTEM ] Failed to receive the leave message. failed: 1
Mar  4 12:31:06 pve corosync[7369]:   [CPG   ] downlist left_list: 1 received
Mar  4 12:31:06 pve pmxcfs[7118]: [dcdb] notice: members: 2/7118
Mar  4 12:31:06 pve pmxcfs[7118]: [status] notice: members: 2/7118
Mar  4 12:31:06 pve pmxcfs[7118]: [status] notice: node lost quorum
Mar  4 12:31:06 pve corosync[7369]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar  4 12:31:06 pve corosync[7369]:   [QUORUM] Members[1]: 2
Mar  4 12:31:06 pve corosync[7369]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar  4 12:31:07 pve pve-ha-crm[7927]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Mar  4 12:31:08 pve pve-ha-lrm[7935]: lost lock 'ha_agent_pve_lock - cfs lock update failed - Permission denied
Mar  4 12:31:12 pve pve-ha-crm[7927]: status change master => lost_manager_lock
Mar  4 12:31:12 pve pve-ha-crm[7927]: watchdog closed (disabled)
Mar  4 12:31:12 pve pve-ha-crm[7927]: status change lost_manager_lock => wait_for_quorum
Mar  4 12:31:13 pve pve-ha-lrm[7935]: status change active => lost_agent_lock
Mar  4 12:31:27 pve kernel: [1066409.101251] igb 0000:01:00.2 eno3: igb: eno3 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar  4 12:31:27 pve kernel: [1066409.103119] vmbr1: port 1(eno3) entered blocking state
Mar  4 12:31:27 pve kernel: [1066409.104691] vmbr1: port 1(eno3) entered forwarding state
Mar  4 12:31:59 pve watchdog-mux[6353]: client watchdog expired - disable watchdog updates

Looks like your NIC went down and there was no connection to the other node. This means it lost quorum and could not update the watchdog that is enabled once a HA resource is defined on that node.
Once the watchdog expires, the node will be fenced which explains the sudden reboot.

Also update to the latest version! You're running a very old version of PVE 6 and there have been some bugfixes especially related to corosync and libknet1.

Edmund Fiadzo · Mar 9, 2021

Thanks for the feedback.

Search

Search

Node on Proxmox Cluster Reboots

Edmund Fiadzo

Member

Attachments

mira

Proxmox Staff Member

Edmund Fiadzo

Member

Edmund Fiadzo

Member

We value your privacy