Node Drops from Cluster When PBS Kicks In

manofoz · Wednesday at 13:48

Hello,

I've seen this happen a few times now and it's always right when PBS goes to do backups on the node. The node keeps logging but drops from the cluster and shows offline. Here are the logs:

]Sep 11 01:00:00 pve04 pvescheduler[562491]: <root@pam> starting task UPIDve04:0008953C:0063106A:66E123D0:vzdump::root@pam:
Sep 11 01:00:00 pve04 pvescheduler[562492]: INFO: starting new backup job: vzdump --all 1 --prune-backups 'keep-last=3' --fleecing 0 --mode snapshot --quiet 1 --no>
Sep 11 01:00:00 pve04 pvescheduler[562492]: INFO: Starting Backup of VM 112 (qemu)
Sep 11 01:00:02 pve04 pmxcfs[1167]: [status] notice: received log
Sep 11 01:00:02 pve04 pmxcfs[1167]: [status] notice: received log
Sep 11 01:00:02 pve04 pmxcfs[1167]: [status] notice: received log
Sep 11 01:00:03 pve04 pmxcfs[1167]: [status] notice: received log
Sep 11 01:00:03 pve04 pmxcfs[1167]: [status] notice: received log
Sep 11 01:00:04 pve04 pmxcfs[1167]: [status] notice: received log
Sep 11 01:00:05 pve04 pmxcfs[1167]: [status] notice: received log
Sep 11 01:02:46 pve04 corosync[1282]: [KNET ] link: host: 7 link: 0 is down
Sep 11 01:02:46 pve04 corosync[1282]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Sep 11 01:02:46 pve04 corosync[1282]: [KNET ] host: host: 7 has no active links
Sep 11 01:02:49 pve04 corosync[1282]: [KNET ] rx: host: 7 link: 0 is up
Sep 11 01:02:49 pve04 corosync[1282]: [KNET ] link: Resetting MTU for link 0 because host 7 joined
Sep 11 01:02:49 pve04 corosync[1282]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Sep 11 01:02:49 pve04 corosync[1282]: [KNET ] pmtud: Global data MTU changed to: 1397
Sep 11 01:05:34 pve04 pvescheduler[562492]: INFO: Finished Backup of VM 112 (00:05:34)
Sep 11 01:05:35 pve04 pvescheduler[562492]: INFO: Starting Backup of VM 117 (qemu)
Sep 11 01:05:35 pve04 systemd[1]: Started 117.scope.
Sep 11 01:05:35 pve04 kernel: rbd: loaded (major 251)
Sep 11 01:05:35 pve04 kernel: libceph: mon1 (1)192.168.40.7:6789 session established
Sep 11 01:05:35 pve04 kernel: libceph: client78911981 fsid 9b0628e1-1fe9-49d2-b65b-746d05215e3d
Sep 11 01:05:36 pve04 kernel: rbd: rbd0: capacity 4194304 features 0x3d
Sep 11 01:05:36 pve04 kernel: tap117i0: entered promiscuous mode
Sep 11 01:05:36 pve04 kernel: vmbr0: port 4(fwpr117p0) entered blocking state
Sep 11 01:05:36 pve04 kernel: vmbr0: port 4(fwpr117p0) entered disabled state
Sep 11 01:05:36 pve04 kernel: fwpr117p0: entered allmulticast mode
Sep 11 01:05:36 pve04 kernel: fwpr117p0: entered promiscuous mode
Sep 11 01:05:36 pve04 kernel: vmbr0: port 4(fwpr117p0) entered blocking state
Sep 11 01:05:36 pve04 kernel: vmbr0: port 4(fwpr117p0) entered forwarding state
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 1(fwln117i0) entered blocking state
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 1(fwln117i0) entered disabled state
Sep 11 01:05:36 pve04 kernel: fwln117i0: entered allmulticast mode
Sep 11 01:05:36 pve04 kernel: fwln117i0: entered promiscuous mode
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 1(fwln117i0) entered blocking state
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 1(fwln117i0) entered forwarding state
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 2(tap117i0) entered blocking state
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 2(tap117i0) entered disabled state
Sep 11 01:05:36 pve04 kernel: tap117i0: entered allmulticast mode
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 2(tap117i0) entered blocking state
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 2(tap117i0) entered forwarding state
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] link: host: 7 link: 0 is down
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] link: host: 3 link: 0 is down
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] link: host: 2 link: 0 is down
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] host: host: 7 has no active links
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] host: host: 3 has no active links
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] host: host: 2 has no active links
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] rx: host: 3 link: 0 is up
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] rx: host: 2 link: 0 is up
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] rx: host: 7 link: 0 is up
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] link: Resetting MTU for link 0 because host 7 joined
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] pmtud: Global data MTU changed to: 1397
Sep 11 01:06:54 pve04 kernel: tap117i0: left allmulticast mode
Sep 11 01:06:54 pve04 kernel: fwbr117i0: port 2(tap117i0) entered disabled state
Sep 11 01:06:54 pve04 kernel: fwbr117i0: port 1(fwln117i0) entered disabled state
Sep 11 01:06:54 pve04 kernel: vmbr0: port 4(fwpr117p0) entered disabled state
Sep 11 01:06:54 pve04 kernel: fwln117i0 (unregistering): left allmulticast mode
Sep 11 01:06:54 pve04 kernel: fwln117i0 (unregistering): left promiscuous mode
Sep 11 01:06:54 pve04 kernel: fwbr117i0: port 1(fwln117i0) entered disabled state
Sep 11 01:06:54 pve04 kernel: fwpr117p0 (unregistering): left allmulticast mode
Sep 11 01:06:54 pve04 kernel: fwpr117p0 (unregistering): left promiscuous mode
Sep 11 01:06:54 pve04 kernel: vmbr0: port 4(fwpr117p0) entered disabled state
Sep 11 01:06:54 pve04 qmeventd[888]: read: Connection reset by peer
Sep 11 01:06:54 pve04 systemd[1]: 117.scope: Deactivated successfully.
Sep 11 01:06:54 pve04 systemd[1]: 117.scope: Consumed 3min 1.147s CPU time.
Sep 11 01:06:55 pve04 qmeventd[566503]: Starting cleanup for 117
Sep 11 01:06:55 pve04 qmeventd[566503]: trying to acquire lock...
Sep 11 01:06:55 pve04 qmeventd[566503]: OK
Sep 11 01:06:55 pve04 qmeventd[566503]: Finished cleanup for 117
Sep 11 01:06:56 pve04 pmxcfs[1167]: [dcdb] notice: data verification successful
Sep 11 01:06:56 pve04 pvescheduler[562492]: INFO: Finished Backup of VM 117 (00:01:21)
Sep 11 01:06:56 pve04 pvescheduler[562492]: INFO: Starting Backup of VM 122 (qemu)
Sep 11 01:06:56 pve04 kernel: i40e 0000:03:00.0: i40e_ptp_stop: removed PHC on enp3s0f0np0
Sep 11 01:06:57 pve04 kernel: vmbr0: port 1(enp3s0f0np0) entered disabled state
Sep 11 01:06:57 pve04 kernel: vmbr0: port 1(enp3s0f0np0) entered disabled state
Sep 11 01:06:57 pve04 kernel: i40e 0000:03:00.0 enp3s0f0np0 (unregistering): left allmulticast mode
Sep 11 01:06:57 pve04 kernel: i40e 0000:03:00.0 enp3s0f0np0 (unregistering): left promiscuous mode
Sep 11 01:06:57 pve04 kernel: vmbr0: port 1(enp3s0f0np0) entered disabled state
Sep 11 01:06:57 pve04 kernel: i40e 0000:03:00.1: i40e_ptp_stop: removed PHC on enp3s0f1np1
Sep 11 01:06:57 pve04 kernel: vmbr1: port 1(enp3s0f1np1) entered disabled state
Sep 11 01:06:57 pve04 kernel: i40e 0000:03:00.1 enp3s0f1np1 (unregistering): left allmulticast mode
Sep 11 01:06:57 pve04 kernel: i40e 0000:03:00.1 enp3s0f1np1 (unregistering): left promiscuous mode
Sep 11 01:06:57 pve04 kernel: vmbr1: port 1(enp3s0f1np1) entered disabled state
Sep 11 01:06:57 pve04 systemd[1]: Started 122.scope.
Sep 11 01:06:58 pve04 kernel: tap122i0: entered promiscuous mode
Sep 11 01:06:58 pve04 kernel: vmbr0: port 1(fwpr122p0) entered blocking state
Sep 11 01:06:58 pve04 kernel: vmbr0: port 1(fwpr122p0) entered disabled state
Sep 11 01:06:58 pve04 kernel: fwpr122p0: entered allmulticast mode
Sep 11 01:06:58 pve04 kernel: fwpr122p0: entered promiscuous mode
Sep 11 01:06:58 pve04 kernel: vmbr0: port 1(fwpr122p0) entered blocking state
Sep 11 01:06:58 pve04 kernel: vmbr0: port 1(fwpr122p0) entered forwarding state
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 1(fwln122i0) entered blocking state
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 1(fwln122i0) entered disabled state
Sep 11 01:06:58 pve04 kernel: fwln122i0: entered allmulticast mode
Sep 11 01:06:58 pve04 kernel: fwln122i0: entered promiscuous mode
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 1(fwln122i0) entered blocking state
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 1(fwln122i0) entered forwarding state
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 2(tap122i0) entered blocking state
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 2(tap122i0) entered disabled state
Sep 11 01:06:58 pve04 kernel: tap122i0: entered allmulticast mode
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 2(tap122i0) entered blocking state
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 2(tap122i0) entered forwarding state
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] link: host: 7 link: 0 is down
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] link: host: 3 link: 0 is down
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] link: host: 2 link: 0 is down
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] link: host: 1 link: 0 is down
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] link: host: 5 link: 0 is down
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] link: host: 4 link: 0 is down
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] link: host: 8 link: 0 is down
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 7 has no active links
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 3 has no active links
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 2 has no active links
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 1 has no active links
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 5 has no active links
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 4 has no active links
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 8 has no active links
Sep 11 01:07:01 pve04 kernel: vfio-pci 0000:03:00.0: Masking broken INTx support
Sep 11 01:07:01 pve04 kernel: vfio-pci 0000:03:00.1: Masking broken INTx support
Sep 11 01:07:02 pve04 corosync[1282]: [TOTEM ] Token has not been received in 5175 ms
Sep 11 01:07:02 pve04 nut-monitor[1076]: Poll UPS [APC-900W-01@nut01.example.com] failed - Server disconnected
Sep 11 01:07:02 pve04 nut-monitor[1076]: Communications with UPS APC-900W-01@nut01.example.com lost
Sep 11 01:07:02 pve04 upssched[566753]: Timer daemon started
Sep 11 01:07:02 pve04 upssched[566753]: New timer: commbad_timer (300 seconds)
Sep 11 01:07:02 pve04 nut-monitor[566748]: Network UPS Tools upsmon 2.8.0
Sep 11 01:07:03 pve04 corosync[1282]: [TOTEM ] A processor failed, forming new configuration: token timed out (6900ms), waiting 8280ms for consensus.
Sep 11 01:07:04 pve04 pvestatd[1496]: got timeout
Sep 11 01:07:12 pve04 corosync[1282]: [QUORUM] Sync members[1]: 6
Sep 11 01:07:12 pve04 corosync[1282]: [QUORUM] Sync left[7]: 1 2 3 4 5 7 8
Sep 11 01:07:12 pve04 corosync[1282]: [TOTEM ] A new membership (6.184a) was formed. Members left: 1 2 3 4 5 7 8
Sep 11 01:07:12 pve04 corosync[1282]: [TOTEM ] Failed to receive the leave message. failed: 1 2 3 4 5 7 8
Sep 11 01:07:12 pve04 pmxcfs[1167]: [dcdb] notice: members: 6/1167
Sep 11 01:07:12 pve04 pmxcfs[1167]: [status] notice: members: 6/1167
Sep 11 01:07:12 pve04 pmxcfs[1167]: [status] notice: node lost quorum
Sep 11 01:07:12 pve04 corosync[1282]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep 11 01:07:12 pve04 corosync[1282]: [QUORUM] Members[1]: 6
Sep 11 01:07:12 pve04 corosync[1282]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 11 01:07:12 pve04 pmxcfs[1167]: [dcdb] crit: received write while not quorate - trigger resync
Sep 11 01:07:12 pve04 pmxcfs[1167]: [dcdb] crit: leaving CPG group
Sep 11 01:07:12 pve04 pmxcfs[1167]: [dcdb] notice: start cluster connection
Sep 11 01:07:12 pve04 pmxcfs[1167]: [dcdb] crit: cpg_join failed: 14
Sep 11 01:07:12 pve04 pmxcfs[1167]: [dcdb] crit: can't initialize service
Sep 11 01:07:12 pve04 pve-ha-crm[1683]: status change slave => wait_for_quorum
Sep 11 01:07:12 pve04 pve-ha-lrm[2200]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve04/lrm_status.tmp.2200' - Permission denied
Sep 11 01:07:12 pve04 pvescheduler[566738]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Sep 11 01:07:12 pve04 pvescheduler[566737]: replication: cfs-lock 'file-replication_cfg' error: no quorum!

And then for the rest of the night it just logs these timeouts:

Sep 11 01:07:54 pve04 kernel: libceph: mon0 (1)192.168.40.6:6789 socket closed (con state V1_BANNER)
Sep 11 01:07:54 pve04 kernel: libceph: mon0 (1)192.168.40.6:6789 socket closed (con state V1_BANNER)
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.7:6802 osd.2 since back 202>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.7:6808 osd.3 since back 202>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.8:6806 osd.4 since back 202>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.8:6802 osd.5 since back 202>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.10:6806 osd.8 since back 20>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.10:6802 osd.9 since back 20>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.6:6808 osd.10 since back 20>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.6:6804 osd.11 since back 20>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.7788>
Sep 11 01:07:54 pve04 pvestatd[1496]: got timeout
Sep 11 01:07:54 pve04 pvestatd[1496]: unable to activate storage 'ISO-Templates' - directory '/mnt/pve/ISO-Templates' does not exist or is unreachable
Sep 11 01:07:54 pve04 ceph-osd[1595]: 2024-09-11T01:07:54.787-0400 766c5c8006c0 -1 osd.6 9102 heartbeat_check: no reply from 192.168.40.7:6802 osd.2 since back 202>
Sep 11 01:07:54 pve04 ceph-osd[1595]: 2024-09-11T01:07:54.787-0400 766c5c8006c0 -1 osd.6 9102 heartbeat_check: no reply from 192.168.40.7:6808 osd.3 since back 202>
Sep 11 01:07:54 pve04 ceph-osd[1595]: 2024-09-11T01:07:54.787-0400 766c5c8006c0 -1 osd.6 9102 heartbeat_check: no reply from 192.168.40.8:6806 osd.4 since back 202>

Not sure what logs to pull to diagnose this failure. Feels hardware related.

Thanks!

VictorSTS · Wednesday at 15:05

Looks like you are sharing the same nic/network for Corosync, Ceph and Backups and you are saturating it, making both corosync and Ceph to lose quorum. At the very least you should have a nic/network for Ceph and another one dedicated to corosync [1].

[1] https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy

manofoz · Wednesday at 22:52

VictorSTS said:
Looks like you are sharing the same nic/network for Corosync, Ceph and Backups and you are saturating it, making both corosync and Ceph to lose quorum. At the very least you should have a nic/network for Ceph and another one dedicated to corosync [1].

[1] https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy

Thanks! The good news is I have an unused NIC in these devices. I use my "main" SFP+ NIC for PVE and Ceph's public network but I do have a separate SFP+ for Ceph's private network, I did not realize I should also segregate the Ceph public network away from my main NIC which corosync and backups use.

Changing IPs for Corosync was a nightmare, I switched the subnet of my cluster, and I did it one device at a time and it was successful but challenging and time consuming. I also have a lot relying on Ceph like a k8s cluster using rook-ceph (external) and changing those IPs will be hard. The NIC I have is 2.5Gbe but I think that is fine for Ceph's public network since it's not crazy fast with my 10Gbe private network.

Do you think just doing the backups over the other NIC would be feasible?

manofoz · Thursday at 06:10

manofoz said:
Thanks! The good news is I have an unused NIC in these devices. I use my "main" SFP+ NIC for PVE and Ceph's public network but I do have a separate SFP+ for Ceph's private network, I did not realize I should also segregate the Ceph public network away from my main NIC which corosync and backups use.

Changing IPs for Corosync was a nightmare, I switched the subnet of my cluster, and I did it one device at a time and it was successful but challenging and time consuming. I also have a lot relying on Ceph like a k8s cluster using rook-ceph (external) and changing those IPs will be hard. The NIC I have is 2.5Gbe but I think that is fine for Ceph's public network since it's not crazy fast with my 10Gbe private network.

Do you think just doing the backups over the other NIC would be feasible?

I just realized you can throttle the bandwidth of pbs in the Backup settings. I'm going to try to set that to 2.5Gbe for the 10Gbe NIC and see if it helps. I can also stagger each VM's backups so not all 8 fire up at once though I've only ever seen this one node loose the cluster.

VictorSTS · Thursday at 08:53

Seems easy to simply add an IP/network to that spare NIC, connect your PBS to the new network and connect PVE to PBS using the new PBS IP. Backup traffic will flow through the spare NIC and reduce the chance of saturating the corosync link. It may still get saturated if Ceph generates lots of traffic: Ceph Cluster network is used for replication traffic only, every access to read or write from/to Ceph flows by the Public network.

Try to add a second corosync link and for your next cluster plan on using a vlan/network for every service, so it may be moved easily from one nic to another.

manofoz · Thursday at 13:59

Thanks! This is great information. I found https://pve.proxmox.com/wiki/Separate_Cluster_Networ which describes how to add a "redundant ring" which would be very easy to do. Since I don't have to lose quorum and orphan nodes like when changing IPs one by one.

Reducing the PBS bandwidth and speeding out the backups didn't help. The node lost the cluster just a bit later since I moved the time PBS ran. It also killed both SPF+ NICs but not the 2.5 Gbps one I use for my VPN VLAN so maybe there is hope for the second 2.5 Gbps as a redundant link.

Sep 12 03:14:50 pve04 kernel: i40e 0000:03:00.0 enp3s0f0np0: NIC Link is Down
Sep 12 03:14:50 pve04 kernel: vmbr0: port 1(enp3s0f0np0) entered disabled state
Sep 12 03:14:50 pve04 kernel: i40e 0000:03:00.1 enp3s0f1np1: NIC Link is Down
Sep 12 03:14:51 pve04 kernel: vmbr1: port 1(enp3s0f1np1) entered disabled state

The link comes back pretty quick too, but nothing recovers:

Sep 12 03:15:39 pve04 kernel: i40e 0000:03:00.0 enp3s0f0np0: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
Sep 12 03:15:39 pve04 kernel: vmbr0: port 1(enp3s0f0np0) entered blocking state
Sep 12 03:15:39 pve04 kernel: vmbr0: port 1(enp3s0f0np0) entered forwarding state

Sep 12 03:15:39 pve04 kernel: i40e 0000:03:00.1 enp3s0f1np1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
Sep 12 03:15:39 pve04 kernel: vmbr1: port 1(enp3s0f1np1) entered blocking state
Sep 12 03:15:39 pve04 kernel: vmbr1: port 1(enp3s0f1np1) entered forwarding state

The weirdest part is I have five identical MS-01's in this cluster but only one is too weak to handle the backup storm.

VictorSTS · Thursday at 14:31

Something's off with those nics or maybe the switch/switch port(s)/gbics/cabling as they seem to lose physical link with your switch and obviously everything depending on them gets cut off. That shoulnd't happen.

Try to find out why that happens, try to overload them with ifperf/iperf3 and see if they keep dropping out. Didn't saw those events in your previous log.

manofoz · Friday at 03:54

VictorSTS said:
Something's off with those nics or maybe the switch/switch port(s)/gbics/cabling as they seem to lose physical link with your switch and obviously everything depending on them gets cut off. That shoulnd't happen.

Try to find out why that happens, try to overload them with ifperf/iperf3 and see if they keep dropping out. Didn't saw those events in your previous log.

It seems like plugging in a GPU to the PCIE 8 lane 16x slot of the MS-01 overloaded it or something. Once the network traffic picks up like when PBS it craps out but without the GPU it's fine. I put an A2000 in it which they advertise as 'compatible' but maybe not if you are going hard on both SFP+ NICs... I'll have to see if anyone with these things had and success, will throw it in another one of them next just to see if it's across the board or isolated to this host.

esi_y · Friday at 05:09

You don't happen to have RSTP on those, do you?

VictorSTS · Friday at 09:48

manofoz said:
It seems like plugging in a GPU to the PCIE 8 lane 16x slot of the MS-01 overloaded it or something. Once the network traffic picks up like when PBS it craps out but without the GPU it's fine. I put an A2000 in it which they advertise as 'compatible' but maybe not if you are going hard on both SFP+ NICs... I'll have to see if anyone with these things had and success, will throw it in another one of them next just to see if it's across the board or isolated to this host.

Just a shot to the sky: haven't played with MS-01 yet, but on some very small Supermicro's they use PCI-e bifurcation internally and you can set how many PCI-e lanes give to each slot. Maybe it's using just one lane for the NIC if the GPU is connected and it gets overloaded when moving some amount of traffic?

manofoz · Friday at 13:59

esi_y said:
You don't happen to have RSTP on those, do you?

My router has it, but I have not set anything up on linux for it.

manofoz · Friday at 14:02

VictorSTS said:
Just a shot to the sky: haven't played with MS-01 yet, but on some very small Supermicro's they use PCI-e bifurcation internally and you can set how many PCI-e lanes give to each slot. Maybe it's using just one lane for the NIC if the GPU is connected and it gets overloaded when moving some amount of traffic?

Very possible something like this is going on. ServeTheHome has a huge thread of "compatible hardware" which I read first before putting anything in it but maybe I put in too much for the small thing. I do have a RX 6400 in another one of the five which has been stable as they get though so this one could just be funky.

After I removed the A2000 it stopped falling off the network so that was defiantly doing it.

VictorSTS · Friday at 14:35

I would swap the A2000 and the RX6400 between nodes and try again. RX6400 has a PCI-e x4 bus while the A2000 has a x16 one, that might be it. It could even be related to the PSU not being able to provide enough power/being faulty. Hope you manage to pinpoint the issue!

manofoz · Friday at 14:55

VictorSTS said:
I would swap the A2000 and the RX6400 between nodes and try again. RX6400 has a PCI-e x4 bus while the A2000 has a x16 one, that might be it. It could even be related to the PSU not being able to provide enough power/being faulty. Hope you manage to pinpoint the issue!

Thanks for the help! I totally will. Good call on the x4 vs. x16 since the A2000 is running on an 8 lane x16 slot so it is already starved for lanes. The RX6400 wasn't a great purchase, I mistakenly thought I could use it for a hackintosh VM but couldn't and it doesn't do video encoding so I can't even hardware accelerate things or use it for Plex. Good for a lightweight gaming VM though.

The next experiment will have to wait as I am leaving for vacation tomorrow. The A2000 needs a custom heatsink to fit in this thing too but I held off on installing it because I wanted to test it first with a riser and the lid off. Can do that for another node though will just need to move some around.

Search

Search

Node Drops from Cluster When PBS Kicks In

manofoz

New Member

VictorSTS

Renowned Member

manofoz

New Member

manofoz

New Member

VictorSTS

Renowned Member

manofoz

New Member

VictorSTS

Renowned Member

manofoz

New Member

esi_y

Active Member

VictorSTS

Renowned Member

manofoz

New Member

manofoz

New Member

VictorSTS

Renowned Member

manofoz

New Member