Hello,
This is a bit weird... The infrastructure is looking like this:
3 x HP Servers (CS01/CS02/CS03) connected via:
30 minutes later, plugging in an interface in a different server (CS02), two machines rebooted - CS01 and CS02. Since then, I kept playing with connecting and disconnecting the interfaces, to see what's going on. Well, the weirdest thing: While CS02 had NO NICs connected, I unplugged CS01 and BOTH CS01 and CS02 rebooted. This made me think that it's not a switch / cable issue, but rather a software issue.
Going through the logs, I saw in the journal log this:
The timestamp matches the reboot almost to the second. The machine rebooted at 22:14:29. Any ideas what's going on, how can I debug this, etc?
As a side-note, is there a way to edit, via the GUI, the bridge-vids from 2-4094 to 2-512 to avoid the
warnings?
Any help with this will be greatly appreciated.
This is a bit weird... The infrastructure is looking like this:
3 x HP Servers (CS01/CS02/CS03) connected via:
- 2 x 100 Gbps Mikrotik switches in an active - passive config
- 2 x 10 Gbps Mikrotik switches in an active - passive config
30 minutes later, plugging in an interface in a different server (CS02), two machines rebooted - CS01 and CS02. Since then, I kept playing with connecting and disconnecting the interfaces, to see what's going on. Well, the weirdest thing: While CS02 had NO NICs connected, I unplugged CS01 and BOTH CS01 and CS02 rebooted. This made me think that it's not a switch / cable issue, but rather a software issue.
Going through the logs, I saw in the journal log this:
Code:
Jun 29 22:13:36 ZRH-GLT-CS02 iscsid[1670]: Kernel reported iSCSI connection 1:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3)
Jun 29 22:13:36 ZRH-GLT-CS02 corosync[1883]: [QUORUM] Sync members[1]: 2
Jun 29 22:13:36 ZRH-GLT-CS02 corosync[1883]: [QUORUM] Sync left[1]: 3
Jun 29 22:13:36 ZRH-GLT-CS02 corosync[1883]: [TOTEM ] A new membership (2.2dc) was formed. Members left: 3
Jun 29 22:13:36 ZRH-GLT-CS02 corosync[1883]: [TOTEM ] Failed to receive the leave message. failed: 3
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] notice: members: 2/1798
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [status] notice: members: 2/1798
Jun 29 22:13:36 ZRH-GLT-CS02 corosync[1883]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jun 29 22:13:36 ZRH-GLT-CS02 corosync[1883]: [QUORUM] Members[1]: 2
Jun 29 22:13:36 ZRH-GLT-CS02 corosync[1883]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [status] notice: node lost quorum
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] crit: received write while not quorate - trigger resync
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] crit: leaving CPG group
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] notice: start cluster connection
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] crit: cpg_join failed: 14
Jun 29 22:13:36 ZRH-GLT-CS02 pve-ha-lrm[1950]: lost lock 'ha_agent_ZRH-GLT-CS02_lock - cfs lock update failed - Device or resource busy
Jun 29 22:13:36 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] crit: can't initialize service
Jun 29 22:13:36 ZRH-GLT-CS02 pve-ha-crm[1938]: status change slave => wait_for_quorum
Jun 29 22:13:41 ZRH-GLT-CS02 pve-ha-lrm[1950]: status change active => lost_agent_lock
Jun 29 22:13:42 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] notice: members: 2/1798
Jun 29 22:13:42 ZRH-GLT-CS02 pmxcfs[1798]: [dcdb] notice: all data is up to date
Jun 29 22:13:58 ZRH-GLT-CS02 iscsid[1670]: connect to 10.41.199.21:3260 failed (No route to host)
Jun 29 22:14:05 ZRH-GLT-CS02 iscsid[1670]: connect to 10.41.199.21:3260 failed (No route to host)
Jun 29 22:14:09 ZRH-GLT-CS02 pvescheduler[3538]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jun 29 22:14:09 ZRH-GLT-CS02 pvescheduler[3537]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jun 29 22:14:12 ZRH-GLT-CS02 iscsid[1670]: connect to 10.41.199.21:3260 failed (No route to host)
Jun 29 22:14:19 ZRH-GLT-CS02 iscsid[1670]: connect to 10.41.199.21:3260 failed (No route to host)
Jun 29 22:14:26 ZRH-GLT-CS02 iscsid[1670]: connect to 10.41.199.21:3260 failed (No route to host)
Jun 29 22:14:27 ZRH-GLT-CS02 watchdog-mux[1422]: client watchdog expired - disable watchdog updates
Code:
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!
auto lo
iface lo inet loopback
auto eno1np0
iface eno1np0 inet manual
auto eno2np1
iface eno2np1 inet manual
auto enp65s0f0np0
iface enp65s0f0np0 inet manual
mtu 9000
auto enp65s0f1np1
iface enp65s0f1np1 inet manual
mtu 9000
auto enp1s0f0np0
iface enp1s0f0np0 inet manual
auto enp1s0f1np1
iface enp1s0f1np1 inet manual
auto bond0
iface bond0 inet manual
bond-slaves enp65s0f0np0 enp65s0f1np1
bond-miimon 100
bond-mode active-backup
bond-primary enp65s0f0np0
mtu 9000
#Internal bond
auto bond1
iface bond1 inet manual
bond-slaves enp1s0f0np0 enp1s0f1np1
bond-miimon 100
bond-mode active-backup
bond-primary enp1s0f0np0
#Internet bond
auto vmbr0
iface vmbr0 inet static
address 10.41.199.32/24
gateway 10.41.199.1
bridge-ports bond0
bridge-stp off
bridge-fd 0
mtu 9000
#Internal network
auto vmbr1
iface vmbr1 inet manual
bridge-ports bond1
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094
#Internet access
auto tm
iface tm inet static
address 10.41.254.37/24
vlan-id 254
vlan-raw-device vmbr1
#Temp Management - Backup for migration
Code:
# pveversion
pve-manager/8.0.3/bbf3993334bfa916 (running kernel: 6.2.16-3-pve)
As a side-note, is there a way to edit, via the GUI, the bridge-vids from 2-4094 to 2-512 to avoid the
Code:
mlx5_core 0000:01:00.1: mlx5e_vport_context_update_vlans:179:(pid 1600): netdev vlans list size (4074) > (512) max vport list size, some vlans will be dropped
Any help with this will be greatly appreciated.