3 Node Cluster Crash and sent more than 1000 email in 5 minutes.

DC-CA1 · Feb 7, 2023

Hi one of our smalls Cluster crashed and i cant explain why yet even when reading the log.

we are running 7.3.3 .

i was looking to maintenance node 1 , i moved a VM to node 2, ZFS local to ZFS local.
our HA was active , so it tried to move it back right away after the task to Node1.

that where something as appened. NODE 1 seem to have crached . if you look at the log it seem something made the Vm bridge going from blocking to fowarding.

right after those log, nothing the server rebooted, i dont even think he got killed by HA .

then Node 3 sent around 1000 emails trying to repeat i think the fence and HA move for ever until the node i think actually resetted by itself .

what the hell is that

Task START:

Feb 7 11:04:08 node_01 pve-ha-lrm[2516888]: Task 'UPID:node_01:0026679B:08C737FB:63E27569:qmigrate:7013:root@pam:' still active, waiting
Feb 7 11:04:09 node_01 pve-ha-lrm[2516888]: <root@pam> end task UPID:node_01:0026679B:08C737FB:63E27569:qmigrate:7013:root@pam: OK
Feb 7 11:04:17 node_01 pmxcfs[2137]: [status] notice: received log
Feb 7 11:04:17 node_01 systemd[1]: Started Session 7015 of user root.
Feb 7 11:04:17 node_01 systemd[1]: session-7015.scope: Succeeded.
Feb 7 11:04:18 node_01 systemd[1]: Started Session 7016 of user root.
Feb 7 11:04:19 node_01 qm[2536688]: <root@pam> starting task UPID:node_01:0026B53B:08C7A5CD:63E27683:qmstart:7013:root@pam:
Feb 7 11:04:19 node_01 qm[2536763]: start VM 7013: UPID:node_01:0026B53B:08C7A5CD:63E27683:qmstart:7013:root@pam:
Feb 7 11:04:19 node_01 kernel: [1473010.930448] debugfs: Directory 'zd64' with parent 'block' already present!
Feb 7 11:04:20 node_01 systemd[1]: Started 7013.scope.
Feb 7 11:04:20 node_01 systemd-udevd[2536954]: Using default interface naming scheme 'v247'.
Feb 7 11:04:20 node_01 systemd-udevd[2536954]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 7 11:04:21 node_01 kernel: [1473012.125572] device tap7013i0 entered promiscuous mode
Feb 7 11:04:21 node_01 systemd-udevd[2536954]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 7 11:04:21 node_01 systemd-udevd[2536954]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 7 11:04:21 node_01 systemd-udevd[2537045]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 7 11:04:21 node_01 systemd-udevd[2537045]: Using default interface naming scheme 'v247'.
Feb 7 11:04:21 node_01 kernel: [1473012.186340] vmbr1: port 3(fwpr7013p0) entered blocking state
Feb 7 11:04:21 node_01 kernel: [1473012.186349] vmbr1: port 3(fwpr7013p0) entered disabled state
Feb 7 11:04:21 node_01 kernel: [1473012.186603] device fwpr7013p0 entered promiscuous mode
Feb 7 11:04:21 node_01 kernel: [1473012.187767] vmbr1: port 3(fwpr7013p0) entered blocking state
Feb 7 11:04:21 node_01 kernel: [1473012.187770] vmbr1: port 3(fwpr7013p0) entered forwarding state
Feb 7 11:04:21 node_01 kernel: [1473012.216419] fwbr7013i0: port 1(fwln7013i0) entered blocking state
Feb 7 11:04:21 node_01 kernel: [1473012.216425] fwbr7013i0: port 1(fwln7013i0) entered disabled state
Feb 7 11:04:21 node_01 kernel: [1473012.216514] device fwln7013i0 entered promiscuous mode
Feb 7 11:04:21 node_01 kernel: [1473012.216587] fwbr7013i0: port 1(fwln7013i0) entered blocking state
Feb 7 11:04:21 node_01 kernel: [1473012.216590] fwbr7013i0: port 1(fwln7013i0) entered forwarding state
Feb 7 11:04:21 node_01 kernel: [1473012.230343] fwbr7013i0: port 2(tap7013i0) entered blocking state
Feb 7 11:04:21 node_01 kernel: [1473012.230348] fwbr7013i0: port 2(tap7013i0) entered disabled state
Feb 7 11:04:21 node_01 kernel: [1473012.230480] fwbr7013i0: port 2(tap7013i0) entered blocking state
Feb 7 11:04:21 node_01 kernel: [1473012.230483] fwbr7013i0: port 2(tap7013i0) entered forwarding state
NODE 1 Crash

DC-CA1 · Feb 8, 2023

@fabian can i send you the log in PV , in never seen anything like that. 1 blade crached and the other returned thousand of lines trying to HA and keep retying his comands and sending email in a infinite loop unti li think the OS crached and that blade rebooted to but buch like becaue of a bug in HA module .

fabian · Feb 8, 2023

the blocking -> forwarding lines are normal when a VM starts, it's just part of the state transition when the network device of the guest gets connected. could you provide the full logs of both nodes (covering like 5 minutes before and after) and the tasks?

DC-CA1 · Feb 8, 2023

@fabian

i suspect that Node 1 as crached during a vm transfer because of ram error

but the node 3 gone crasy as you will see and for ever sent email and log of trying to fence again and again the VM , all that inside of 3-5 min

my 3 node cluster than completly crashed cause for some reason Node3 rebooted to . i took a look at the core stack switch to see if it might be related but i dont see any error. and i had throttling enable in proxmox for jobs with data moving.

so what appened :

i moved a vm from node 1 to 2 OK , but HA had a priority on node 1 so few second later it moved it back from 2 to 1 , but the node 1 failed during the transfer. and from there Node 3 gone crasy for some reason.

hope you can find something this was really scary scenario

alexskysilk · Feb 8, 2023

relevent entry:
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [quorum] crit: quorum_initialize failed: 2
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [quorum] crit: can't initialize service
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [confdb] crit: cmap_initialize failed: 2
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [confdb] crit: can't initialize service
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [dcdb] crit: cpg_initialize failed: 2
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [dcdb] crit: can't initialize service
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [status] crit: cpg_initialize failed: 2
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [status] crit: can't initialize service

your nodes are getting fenced due to network issues.
Please provide:
/etc/network/interfaces
/etc/corosync/corosync.conf
/etc/hosts

alexskysilk · Feb 8, 2023

also a description of how the nodes are connected, along with whatever switches are in the way.

DC-CA1 · Feb 8, 2023

alexskysilk said:
also a description of how the nodes are connected, along with whatever switches are in the way.

hi alex

actually we think we got a issue similar to this :

https://forum.proxmox.com/threads/w...-reboot-nodes-in-an-ha-enabled-cluster.97575/

we think Node 1 failed due to Ram issue, but Node 3 started acting wierd. based on your Comment at 11h13 am the Node 3 was back online so this seem to have occured during the reboot from proxmox.
the only thing i know we did is to install Ceph on each node 3 days ago but not in production yet

fabian · Feb 9, 2023

so the logs on node1 are definitely incomplete (possibly not persisted to disk). the logs of node 2 and the involved tasks would still be interesting.

AFAICT:
- node 1 got fenced
- node 3 (CRM) triggered some edge case where node 1 oscillated between states fenced and unknown (likely source of the mails)
- node 1 got back up at 11:09
- node 3 went down at 11:09:55 (logs on node 3 incomplete, logs on node 2 are not posted, so no idea - could be the same cause as node 1 going down)
- node 3 comes backup at 11:13
- node 1 finally gets around to actually enabling HA again at 11:16
- node 3 takes a bit longer to go through the backlog of queued up notification mails

I see nothing there that resembles that old bug with a cluster-wide failure - the information is rather incomplete, but it looks more to me like your nodes are undersized to handle one of them going down, and thus recovery takes longer since the additional load can cause further problems including potentially taking down additional nodes down the line.

the node state oscillation in the HA logs of node 3 does look wrong though - could you provide the missing logs (node 2, tasks) and also "pveversion -v" of all three nodes? thanks!

DC-CA1 · Feb 9, 2023

fabian said:
so the logs on node1 are definitely incomplete (possibly not persisted to disk). the logs of node 2 and the involved tasks would still be interesting.

AFAICT:
- node 1 got fenced
- node 3 (CRM) triggered some edge case where node 1 oscillated between states fenced and unknown (likely source of the mails)
- node 1 got back up at 11:09
- node 3 went down at 11:09:55 (logs on node 3 incomplete, logs on node 2 are not posted, so no idea - could be the same cause as node 1 going down)
- node 3 comes backup at 11:13
- node 1 finally gets around to actually enabling HA again at 11:16
- node 3 takes a bit longer to go through the backlog of queued up notification mails

I see nothing there that resembles that old bug with a cluster-wide failure - the information is rather incomplete, but it looks more to me like your nodes are undersized to handle one of them going down, and thus recovery takes longer since the additional load can cause further problems including potentially taking down additional nodes down the line.

the node state oscillation in the HA logs of node 3 does look wrong though - could you provide the missing logs (node 2, tasks) and also "pveversion -v" of all three nodes? thanks!

hi Fabian, will try to send you Node2 detail today. Node 2 was restarted 1 hours earlier to test the installed Ceph drive on it sucessfuly , i was to install Ceph on node 1 , so i moved 2 small mikrotik router and a VM of 20GB that is a virtual gateway that we use for test only.
you mention not persisted to disk, would make sense. i seen that my collegue builded that 3 cluster node in ZFS mirror ( 2 SAS 300 rotated disk )
nothing is sleeping on that cluster excpect 4 Virtual mikrotik routeur of 128mb each with replication.

the issue occured when trying to move the 20GB disk, maybe this capped for some reason the SAS drives ( i/o limit ) ? as they are not SSDs ?
from node2 we where seeing nothing expect the node 1 crashing and node3 too after.

LOUIS · Mar 14, 2023

Hi, did you solve your issue ?

Search

Search

3 Node Cluster Crash and sent more than 1000 email in 5 minutes.

DC-CA1

Member

DC-CA1

Member

fabian

Proxmox Staff Member

DC-CA1

Member

Attachments

alexskysilk

Distinguished Member

alexskysilk

Distinguished Member

DC-CA1

Member

fabian

Proxmox Staff Member

DC-CA1

Member

LOUIS

Renowned Member