Hello,
I've seen this happen a few times now and it's always right when PBS goes to do backups on the node. The node keeps logging but drops from the cluster and shows offline. Here are the logs:
And then for the rest of the night it just logs these timeouts:
Not sure what logs to pull to diagnose this failure. Feels hardware related.
Thanks!
I've seen this happen a few times now and it's always right when PBS goes to do backups on the node. The node keeps logging but drops from the cluster and shows offline. Here are the logs:
]Sep 11 01:00:00 pve04 pvescheduler[562491]: <root@pam> starting task UPIDve04:0008953C:0063106A:66E123D0:vzdump::root@pam:
Sep 11 01:00:00 pve04 pvescheduler[562492]: INFO: starting new backup job: vzdump --all 1 --prune-backups 'keep-last=3' --fleecing 0 --mode snapshot --quiet 1 --no>
Sep 11 01:00:00 pve04 pvescheduler[562492]: INFO: Starting Backup of VM 112 (qemu)
Sep 11 01:00:02 pve04 pmxcfs[1167]: [status] notice: received log
Sep 11 01:00:02 pve04 pmxcfs[1167]: [status] notice: received log
Sep 11 01:00:02 pve04 pmxcfs[1167]: [status] notice: received log
Sep 11 01:00:03 pve04 pmxcfs[1167]: [status] notice: received log
Sep 11 01:00:03 pve04 pmxcfs[1167]: [status] notice: received log
Sep 11 01:00:04 pve04 pmxcfs[1167]: [status] notice: received log
Sep 11 01:00:05 pve04 pmxcfs[1167]: [status] notice: received log
Sep 11 01:02:46 pve04 corosync[1282]: [KNET ] link: host: 7 link: 0 is down
Sep 11 01:02:46 pve04 corosync[1282]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Sep 11 01:02:46 pve04 corosync[1282]: [KNET ] host: host: 7 has no active links
Sep 11 01:02:49 pve04 corosync[1282]: [KNET ] rx: host: 7 link: 0 is up
Sep 11 01:02:49 pve04 corosync[1282]: [KNET ] link: Resetting MTU for link 0 because host 7 joined
Sep 11 01:02:49 pve04 corosync[1282]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Sep 11 01:02:49 pve04 corosync[1282]: [KNET ] pmtud: Global data MTU changed to: 1397
Sep 11 01:05:34 pve04 pvescheduler[562492]: INFO: Finished Backup of VM 112 (00:05:34)
Sep 11 01:05:35 pve04 pvescheduler[562492]: INFO: Starting Backup of VM 117 (qemu)
Sep 11 01:05:35 pve04 systemd[1]: Started 117.scope.
Sep 11 01:05:35 pve04 kernel: rbd: loaded (major 251)
Sep 11 01:05:35 pve04 kernel: libceph: mon1 (1)192.168.40.7:6789 session established
Sep 11 01:05:35 pve04 kernel: libceph: client78911981 fsid 9b0628e1-1fe9-49d2-b65b-746d05215e3d
Sep 11 01:05:36 pve04 kernel: rbd: rbd0: capacity 4194304 features 0x3d
Sep 11 01:05:36 pve04 kernel: tap117i0: entered promiscuous mode
Sep 11 01:05:36 pve04 kernel: vmbr0: port 4(fwpr117p0) entered blocking state
Sep 11 01:05:36 pve04 kernel: vmbr0: port 4(fwpr117p0) entered disabled state
Sep 11 01:05:36 pve04 kernel: fwpr117p0: entered allmulticast mode
Sep 11 01:05:36 pve04 kernel: fwpr117p0: entered promiscuous mode
Sep 11 01:05:36 pve04 kernel: vmbr0: port 4(fwpr117p0) entered blocking state
Sep 11 01:05:36 pve04 kernel: vmbr0: port 4(fwpr117p0) entered forwarding state
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 1(fwln117i0) entered blocking state
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 1(fwln117i0) entered disabled state
Sep 11 01:05:36 pve04 kernel: fwln117i0: entered allmulticast mode
Sep 11 01:05:36 pve04 kernel: fwln117i0: entered promiscuous mode
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 1(fwln117i0) entered blocking state
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 1(fwln117i0) entered forwarding state
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 2(tap117i0) entered blocking state
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 2(tap117i0) entered disabled state
Sep 11 01:05:36 pve04 kernel: tap117i0: entered allmulticast mode
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 2(tap117i0) entered blocking state
Sep 11 01:05:36 pve04 kernel: fwbr117i0: port 2(tap117i0) entered forwarding state
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] link: host: 7 link: 0 is down
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] link: host: 3 link: 0 is down
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] link: host: 2 link: 0 is down
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] host: host: 7 has no active links
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] host: host: 3 has no active links
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 11 01:05:58 pve04 corosync[1282]: [KNET ] host: host: 2 has no active links
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] rx: host: 3 link: 0 is up
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] rx: host: 2 link: 0 is up
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] rx: host: 7 link: 0 is up
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] link: Resetting MTU for link 0 because host 7 joined
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Sep 11 01:06:01 pve04 corosync[1282]: [KNET ] pmtud: Global data MTU changed to: 1397
Sep 11 01:06:54 pve04 kernel: tap117i0: left allmulticast mode
Sep 11 01:06:54 pve04 kernel: fwbr117i0: port 2(tap117i0) entered disabled state
Sep 11 01:06:54 pve04 kernel: fwbr117i0: port 1(fwln117i0) entered disabled state
Sep 11 01:06:54 pve04 kernel: vmbr0: port 4(fwpr117p0) entered disabled state
Sep 11 01:06:54 pve04 kernel: fwln117i0 (unregistering): left allmulticast mode
Sep 11 01:06:54 pve04 kernel: fwln117i0 (unregistering): left promiscuous mode
Sep 11 01:06:54 pve04 kernel: fwbr117i0: port 1(fwln117i0) entered disabled state
Sep 11 01:06:54 pve04 kernel: fwpr117p0 (unregistering): left allmulticast mode
Sep 11 01:06:54 pve04 kernel: fwpr117p0 (unregistering): left promiscuous mode
Sep 11 01:06:54 pve04 kernel: vmbr0: port 4(fwpr117p0) entered disabled state
Sep 11 01:06:54 pve04 qmeventd[888]: read: Connection reset by peer
Sep 11 01:06:54 pve04 systemd[1]: 117.scope: Deactivated successfully.
Sep 11 01:06:54 pve04 systemd[1]: 117.scope: Consumed 3min 1.147s CPU time.
Sep 11 01:06:55 pve04 qmeventd[566503]: Starting cleanup for 117
Sep 11 01:06:55 pve04 qmeventd[566503]: trying to acquire lock...
Sep 11 01:06:55 pve04 qmeventd[566503]: OK
Sep 11 01:06:55 pve04 qmeventd[566503]: Finished cleanup for 117
Sep 11 01:06:56 pve04 pmxcfs[1167]: [dcdb] notice: data verification successful
Sep 11 01:06:56 pve04 pvescheduler[562492]: INFO: Finished Backup of VM 117 (00:01:21)
Sep 11 01:06:56 pve04 pvescheduler[562492]: INFO: Starting Backup of VM 122 (qemu)
Sep 11 01:06:56 pve04 kernel: i40e 0000:03:00.0: i40e_ptp_stop: removed PHC on enp3s0f0np0
Sep 11 01:06:57 pve04 kernel: vmbr0: port 1(enp3s0f0np0) entered disabled state
Sep 11 01:06:57 pve04 kernel: vmbr0: port 1(enp3s0f0np0) entered disabled state
Sep 11 01:06:57 pve04 kernel: i40e 0000:03:00.0 enp3s0f0np0 (unregistering): left allmulticast mode
Sep 11 01:06:57 pve04 kernel: i40e 0000:03:00.0 enp3s0f0np0 (unregistering): left promiscuous mode
Sep 11 01:06:57 pve04 kernel: vmbr0: port 1(enp3s0f0np0) entered disabled state
Sep 11 01:06:57 pve04 kernel: i40e 0000:03:00.1: i40e_ptp_stop: removed PHC on enp3s0f1np1
Sep 11 01:06:57 pve04 kernel: vmbr1: port 1(enp3s0f1np1) entered disabled state
Sep 11 01:06:57 pve04 kernel: i40e 0000:03:00.1 enp3s0f1np1 (unregistering): left allmulticast mode
Sep 11 01:06:57 pve04 kernel: i40e 0000:03:00.1 enp3s0f1np1 (unregistering): left promiscuous mode
Sep 11 01:06:57 pve04 kernel: vmbr1: port 1(enp3s0f1np1) entered disabled state
Sep 11 01:06:57 pve04 systemd[1]: Started 122.scope.
Sep 11 01:06:58 pve04 kernel: tap122i0: entered promiscuous mode
Sep 11 01:06:58 pve04 kernel: vmbr0: port 1(fwpr122p0) entered blocking state
Sep 11 01:06:58 pve04 kernel: vmbr0: port 1(fwpr122p0) entered disabled state
Sep 11 01:06:58 pve04 kernel: fwpr122p0: entered allmulticast mode
Sep 11 01:06:58 pve04 kernel: fwpr122p0: entered promiscuous mode
Sep 11 01:06:58 pve04 kernel: vmbr0: port 1(fwpr122p0) entered blocking state
Sep 11 01:06:58 pve04 kernel: vmbr0: port 1(fwpr122p0) entered forwarding state
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 1(fwln122i0) entered blocking state
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 1(fwln122i0) entered disabled state
Sep 11 01:06:58 pve04 kernel: fwln122i0: entered allmulticast mode
Sep 11 01:06:58 pve04 kernel: fwln122i0: entered promiscuous mode
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 1(fwln122i0) entered blocking state
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 1(fwln122i0) entered forwarding state
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 2(tap122i0) entered blocking state
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 2(tap122i0) entered disabled state
Sep 11 01:06:58 pve04 kernel: tap122i0: entered allmulticast mode
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 2(tap122i0) entered blocking state
Sep 11 01:06:58 pve04 kernel: fwbr122i0: port 2(tap122i0) entered forwarding state
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] link: host: 7 link: 0 is down
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] link: host: 3 link: 0 is down
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] link: host: 2 link: 0 is down
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] link: host: 1 link: 0 is down
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] link: host: 5 link: 0 is down
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] link: host: 4 link: 0 is down
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] link: host: 8 link: 0 is down
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 7 has no active links
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 3 has no active links
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 2 has no active links
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 1 has no active links
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 5 has no active links
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 4 has no active links
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Sep 11 01:06:59 pve04 corosync[1282]: [KNET ] host: host: 8 has no active links
Sep 11 01:07:01 pve04 kernel: vfio-pci 0000:03:00.0: Masking broken INTx support
Sep 11 01:07:01 pve04 kernel: vfio-pci 0000:03:00.1: Masking broken INTx support
Sep 11 01:07:02 pve04 corosync[1282]: [TOTEM ] Token has not been received in 5175 ms
Sep 11 01:07:02 pve04 nut-monitor[1076]: Poll UPS [APC-900W-01@nut01.example.com] failed - Server disconnected
Sep 11 01:07:02 pve04 nut-monitor[1076]: Communications with UPS APC-900W-01@nut01.example.com lost
Sep 11 01:07:02 pve04 upssched[566753]: Timer daemon started
Sep 11 01:07:02 pve04 upssched[566753]: New timer: commbad_timer (300 seconds)
Sep 11 01:07:02 pve04 nut-monitor[566748]: Network UPS Tools upsmon 2.8.0
Sep 11 01:07:03 pve04 corosync[1282]: [TOTEM ] A processor failed, forming new configuration: token timed out (6900ms), waiting 8280ms for consensus.
Sep 11 01:07:04 pve04 pvestatd[1496]: got timeout
Sep 11 01:07:12 pve04 corosync[1282]: [QUORUM] Sync members[1]: 6
Sep 11 01:07:12 pve04 corosync[1282]: [QUORUM] Sync left[7]: 1 2 3 4 5 7 8
Sep 11 01:07:12 pve04 corosync[1282]: [TOTEM ] A new membership (6.184a) was formed. Members left: 1 2 3 4 5 7 8
Sep 11 01:07:12 pve04 corosync[1282]: [TOTEM ] Failed to receive the leave message. failed: 1 2 3 4 5 7 8
Sep 11 01:07:12 pve04 pmxcfs[1167]: [dcdb] notice: members: 6/1167
Sep 11 01:07:12 pve04 pmxcfs[1167]: [status] notice: members: 6/1167
Sep 11 01:07:12 pve04 pmxcfs[1167]: [status] notice: node lost quorum
Sep 11 01:07:12 pve04 corosync[1282]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep 11 01:07:12 pve04 corosync[1282]: [QUORUM] Members[1]: 6
Sep 11 01:07:12 pve04 corosync[1282]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 11 01:07:12 pve04 pmxcfs[1167]: [dcdb] crit: received write while not quorate - trigger resync
Sep 11 01:07:12 pve04 pmxcfs[1167]: [dcdb] crit: leaving CPG group
Sep 11 01:07:12 pve04 pmxcfs[1167]: [dcdb] notice: start cluster connection
Sep 11 01:07:12 pve04 pmxcfs[1167]: [dcdb] crit: cpg_join failed: 14
Sep 11 01:07:12 pve04 pmxcfs[1167]: [dcdb] crit: can't initialize service
Sep 11 01:07:12 pve04 pve-ha-crm[1683]: status change slave => wait_for_quorum
Sep 11 01:07:12 pve04 pve-ha-lrm[2200]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve04/lrm_status.tmp.2200' - Permission denied
Sep 11 01:07:12 pve04 pvescheduler[566738]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Sep 11 01:07:12 pve04 pvescheduler[566737]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
And then for the rest of the night it just logs these timeouts:
Sep 11 01:07:54 pve04 kernel: libceph: mon0 (1)192.168.40.6:6789 socket closed (con state V1_BANNER)
Sep 11 01:07:54 pve04 kernel: libceph: mon0 (1)192.168.40.6:6789 socket closed (con state V1_BANNER)
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.7:6802 osd.2 since back 202>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.7:6808 osd.3 since back 202>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.8:6806 osd.4 since back 202>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.8:6802 osd.5 since back 202>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.10:6806 osd.8 since back 20>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.10:6802 osd.9 since back 20>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.6:6808 osd.10 since back 20>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 heartbeat_check: no reply from 192.168.40.6:6804 osd.11 since back 20>
Sep 11 01:07:54 pve04 ceph-osd[1593]: 2024-09-11T01:07:54.317-0400 778a5fa006c0 -1 osd.7 9102 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.7788>
Sep 11 01:07:54 pve04 pvestatd[1496]: got timeout
Sep 11 01:07:54 pve04 pvestatd[1496]: unable to activate storage 'ISO-Templates' - directory '/mnt/pve/ISO-Templates' does not exist or is unreachable
Sep 11 01:07:54 pve04 ceph-osd[1595]: 2024-09-11T01:07:54.787-0400 766c5c8006c0 -1 osd.6 9102 heartbeat_check: no reply from 192.168.40.7:6802 osd.2 since back 202>
Sep 11 01:07:54 pve04 ceph-osd[1595]: 2024-09-11T01:07:54.787-0400 766c5c8006c0 -1 osd.6 9102 heartbeat_check: no reply from 192.168.40.7:6808 osd.3 since back 202>
Sep 11 01:07:54 pve04 ceph-osd[1595]: 2024-09-11T01:07:54.787-0400 766c5c8006c0 -1 osd.6 9102 heartbeat_check: no reply from 192.168.40.8:6806 osd.4 since back 202>
Not sure what logs to pull to diagnose this failure. Feels hardware related.
Thanks!