Proxmox hosts rebooting when losing NIC link

hac3ru

Member
Mar 6, 2021
42
0
11
32
Hello,

This is a bit weird... The infrastructure is looking like this:
3 x HP Servers (CS01/CS02/CS03) connected via:
  • 2 x 100 Gbps Mikrotik switches in an active - passive config
  • 2 x 10 Gbps Mikrotik switches in an active - passive config
Today, while performing regular maintenance work, all three nodes rebooted. This was weird, I thought of a power loss so I checked the IPMI logs which revealed nothing. The reboot happened when I disconnected one of the 100Gbps interfaces from the CS01 server. I thought it was some weird stuff that's not worth pursuing, unless it happens again.
30 minutes later, plugging in an interface in a different server (CS02), two machines rebooted - CS01 and CS02. Since then, I kept playing with connecting and disconnecting the interfaces, to see what's going on. Well, the weirdest thing: While CS02 had NO NICs connected, I unplugged CS01 and BOTH CS01 and CS02 rebooted. This made me think that it's not a switch / cable issue, but rather a software issue.
Going through the logs, I saw in the journal log this:
Code:
Jun 29 22:13:36 ZRH-GLT-CS02 iscsid[1670]: Kernel reported iSCSI connection 1:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3)
Jun 29 22:13:36 ZRH-GLT-CS02 corosync[1883]:   [QUORUM] Sync members[1]: 2
Jun 29 22:13:36 ZRH-GLT-CS02 corosync[1883]:   [QUORUM] Sync left[1]: 3
Jun 29 22:13:36 ZRH-GLT-CS02 corosync[1883]:   [TOTEM ] A new membership (2.2dc) was formed. Members left: 3
Jun 29 22:13:36 ZRH-GLT-CS02 corosync[1883]:   [TOTEM ] Failed to receive the leave message. failed: 3
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] notice: members: 2/1798
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [status] notice: members: 2/1798
Jun 29 22:13:36 ZRH-GLT-CS02 corosync[1883]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jun 29 22:13:36 ZRH-GLT-CS02 corosync[1883]:   [QUORUM] Members[1]: 2
Jun 29 22:13:36 ZRH-GLT-CS02 corosync[1883]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [status] notice: node lost quorum
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] crit: received write while not quorate - trigger resync
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] crit: leaving CPG group
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] notice: start cluster connection
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] crit: cpg_join failed: 14
Jun 29 22:13:36 ZRH-GLT-CS02 pve-ha-lrm[1950]: lost lock 'ha_agent_ZRH-GLT-CS02_lock - cfs lock update failed - Device or resource busy
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] crit: can't initialize service
Jun 29 22:13:36 ZRH-GLT-CS02 pve-ha-crm[1938]: status change slave => wait_for_quorum
Jun 29 22:13:41 ZRH-GLT-CS02 pve-ha-lrm[1950]: status change active => lost_agent_lock
Jun 29 22:13:42 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] notice: members: 2/1798
Jun 29 22:13:42 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] notice: all data is up to date
Jun 29 22:13:58 ZRH-GLT-CS02 iscsid[1670]: connect to 10.41.199.21:3260 failed (No route to host)
Jun 29 22:14:05 ZRH-GLT-CS02 iscsid[1670]: connect to 10.41.199.21:3260 failed (No route to host)
Jun 29 22:14:09 ZRH-GLT-CS02 pvescheduler[3538]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jun 29 22:14:09 ZRH-GLT-CS02 pvescheduler[3537]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jun 29 22:14:12 ZRH-GLT-CS02 iscsid[1670]: connect to 10.41.199.21:3260 failed (No route to host)
Jun 29 22:14:19 ZRH-GLT-CS02 iscsid[1670]: connect to 10.41.199.21:3260 failed (No route to host)
Jun 29 22:14:26 ZRH-GLT-CS02 iscsid[1670]: connect to 10.41.199.21:3260 failed (No route to host)
Jun 29 22:14:27 ZRH-GLT-CS02 watchdog-mux[1422]: client watchdog expired - disable watchdog updates
The timestamp matches the reboot almost to the second. The machine rebooted at 22:14:29. Any ideas what's going on, how can I debug this, etc?

Code:
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto eno1np0
iface eno1np0 inet manual

auto eno2np1
iface eno2np1 inet manual

auto enp65s0f0np0
iface enp65s0f0np0 inet manual
        mtu 9000

auto enp65s0f1np1
iface enp65s0f1np1 inet manual
        mtu 9000

auto enp1s0f0np0
iface enp1s0f0np0 inet manual

auto enp1s0f1np1
iface enp1s0f1np1 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves enp65s0f0np0 enp65s0f1np1
        bond-miimon 100
        bond-mode active-backup
        bond-primary enp65s0f0np0
        mtu 9000
#Internal bond

auto bond1
iface bond1 inet manual
        bond-slaves enp1s0f0np0 enp1s0f1np1
        bond-miimon 100
        bond-mode active-backup
        bond-primary enp1s0f0np0
#Internet bond

auto vmbr0
iface vmbr0 inet static
        address 10.41.199.32/24
        gateway 10.41.199.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        mtu 9000
#Internal network

auto vmbr1
iface vmbr1 inet manual
        bridge-ports bond1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
#Internet access

auto tm
iface tm inet static
        address 10.41.254.37/24
        vlan-id 254
        vlan-raw-device vmbr1
#Temp Management - Backup for migration

Code:
# pveversion
pve-manager/8.0.3/bbf3993334bfa916 (running kernel: 6.2.16-3-pve)

As a side-note, is there a way to edit, via the GUI, the bridge-vids from 2-4094 to 2-512 to avoid the
Code:
mlx5_core 0000:01:00.1: mlx5e_vport_context_update_vlans:179:(pid 1600): netdev vlans list size (4074) > (512) max vport list size, some vlans will be dropped
warnings?

Any help with this will be greatly appreciated.
 
Hi hac3ru,

Could you please share how you managed to fix your issue? I'm experiencing the exact same problem on a new Proxmox installation and my setup is very similar to yours.

Thanks,

Chris
 
Hello @cbourque,

I fixed the issue by using an additional link for the corosync cluster. That way, when the ring0 link fails, the cluster is still able to communicate using ring1 (different fiber, different switch, etc) so the node knows that it's not "isolated". The issue with this is that, if ring0 fails (traffic network) the machines are not migrated / started on any other machines, since the nodes are still able to communicate with each other.
 
Hello @hac3ru,

Ok I see, the odd thing in my case is that I'm not even using HA, it's a standalone server, but somehow I get the same behaviour...

Anyway, thanks for replying I really appreciate it and I'll post back when I find something.

Thanks,

C.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!