3 Node Cluster Crash and sent more than 1000 email in 5 minutes.

Jan 16, 2022
195
8
23
38
Hi one of our smalls Cluster crashed and i cant explain why yet even when reading the log.

we are running 7.3.3 .

i was looking to maintenance node 1 , i moved a VM to node 2, ZFS local to ZFS local.
our HA was active , so it tried to move it back right away after the task to Node1.

that where something as appened. NODE 1 seem to have crached . if you look at the log it seem something made the Vm bridge going from blocking to fowarding.

right after those log, nothing the server rebooted, i dont even think he got killed by HA .

then Node 3 sent around 1000 emails trying to repeat i think the fence and HA move for ever until the node i think actually resetted by itself .

what the hell is that :)

Task START:

Feb 7 11:04:08 node_01 pve-ha-lrm[2516888]: Task 'UPID:node_01:0026679B:08C737FB:63E27569:qmigrate:7013:root@pam:' still active, waiting
Feb 7 11:04:09 node_01 pve-ha-lrm[2516888]: <root@pam> end task UPID:node_01:0026679B:08C737FB:63E27569:qmigrate:7013:root@pam: OK
Feb 7 11:04:17 node_01 pmxcfs[2137]: [status] notice: received log
Feb 7 11:04:17 node_01 systemd[1]: Started Session 7015 of user root.
Feb 7 11:04:17 node_01 systemd[1]: session-7015.scope: Succeeded.
Feb 7 11:04:18 node_01 systemd[1]: Started Session 7016 of user root.
Feb 7 11:04:19 node_01 qm[2536688]: <root@pam> starting task UPID:node_01:0026B53B:08C7A5CD:63E27683:qmstart:7013:root@pam:
Feb 7 11:04:19 node_01 qm[2536763]: start VM 7013: UPID:node_01:0026B53B:08C7A5CD:63E27683:qmstart:7013:root@pam:
Feb 7 11:04:19 node_01 kernel: [1473010.930448] debugfs: Directory 'zd64' with parent 'block' already present!
Feb 7 11:04:20 node_01 systemd[1]: Started 7013.scope.
Feb 7 11:04:20 node_01 systemd-udevd[2536954]: Using default interface naming scheme 'v247'.
Feb 7 11:04:20 node_01 systemd-udevd[2536954]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 7 11:04:21 node_01 kernel: [1473012.125572] device tap7013i0 entered promiscuous mode
Feb 7 11:04:21 node_01 systemd-udevd[2536954]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 7 11:04:21 node_01 systemd-udevd[2536954]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 7 11:04:21 node_01 systemd-udevd[2537045]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 7 11:04:21 node_01 systemd-udevd[2537045]: Using default interface naming scheme 'v247'.
Feb 7 11:04:21 node_01 kernel: [1473012.186340] vmbr1: port 3(fwpr7013p0) entered blocking state
Feb 7 11:04:21 node_01 kernel: [1473012.186349] vmbr1: port 3(fwpr7013p0) entered disabled state
Feb 7 11:04:21 node_01 kernel: [1473012.186603] device fwpr7013p0 entered promiscuous mode
Feb 7 11:04:21 node_01 kernel: [1473012.187767] vmbr1: port 3(fwpr7013p0) entered blocking state
Feb 7 11:04:21 node_01 kernel: [1473012.187770] vmbr1: port 3(fwpr7013p0) entered forwarding state
Feb 7 11:04:21 node_01 kernel: [1473012.216419] fwbr7013i0: port 1(fwln7013i0) entered blocking state
Feb 7 11:04:21 node_01 kernel: [1473012.216425] fwbr7013i0: port 1(fwln7013i0) entered disabled state
Feb 7 11:04:21 node_01 kernel: [1473012.216514] device fwln7013i0 entered promiscuous mode
Feb 7 11:04:21 node_01 kernel: [1473012.216587] fwbr7013i0: port 1(fwln7013i0) entered blocking state
Feb 7 11:04:21 node_01 kernel: [1473012.216590] fwbr7013i0: port 1(fwln7013i0) entered forwarding state
Feb 7 11:04:21 node_01 kernel: [1473012.230343] fwbr7013i0: port 2(tap7013i0) entered blocking state
Feb 7 11:04:21 node_01 kernel: [1473012.230348] fwbr7013i0: port 2(tap7013i0) entered disabled state
Feb 7 11:04:21 node_01 kernel: [1473012.230480] fwbr7013i0: port 2(tap7013i0) entered blocking state
Feb 7 11:04:21 node_01 kernel: [1473012.230483] fwbr7013i0: port 2(tap7013i0) entered forwarding state
NODE 1 Crash
 
Last edited:
@fabian can i send you the log in PV , in never seen anything like that. 1 blade crached and the other returned thousand of lines trying to HA and keep retying his comands and sending email in a infinite loop unti li think the OS crached and that blade rebooted to but buch like becaue of a bug in HA module .
 
the blocking -> forwarding lines are normal when a VM starts, it's just part of the state transition when the network device of the guest gets connected. could you provide the full logs of both nodes (covering like 5 minutes before and after) and the tasks?
 
@fabian


i suspect that Node 1 as crached during a vm transfer because of ram error

but the node 3 gone crasy as you will see and for ever sent email and log of trying to fence again and again the VM , all that inside of 3-5 min

my 3 node cluster than completly crashed cause for some reason Node3 rebooted to . i took a look at the core stack switch to see if it might be related but i dont see any error. and i had throttling enable in proxmox for jobs with data moving.

so what appened :

i moved a vm from node 1 to 2 OK , but HA had a priority on node 1 so few second later it moved it back from 2 to 1 , but the node 1 failed during the transfer. and from there Node 3 gone crasy for some reason.


hope you can find something this was really scary scenario
 

Attachments

  • clusterfail.zip
    552.9 KB · Views: 2
relevent entry:
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [quorum] crit: quorum_initialize failed: 2
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [quorum] crit: can't initialize service
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [confdb] crit: cmap_initialize failed: 2
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [confdb] crit: can't initialize service
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [dcdb] crit: cpg_initialize failed: 2
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [dcdb] crit: can't initialize service
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [status] crit: cpg_initialize failed: 2
Feb 7 11:13:40 cluster1-bl3 pmxcfs[3009]: [status] crit: can't initialize service


your nodes are getting fenced due to network issues.
Please provide:
/etc/network/interfaces
/etc/corosync/corosync.conf
/etc/hosts
 
also a description of how the nodes are connected, along with whatever switches are in the way.
hi alex

actually we think we got a issue similar to this :

https://forum.proxmox.com/threads/w...-reboot-nodes-in-an-ha-enabled-cluster.97575/

we think Node 1 failed due to Ram issue, but Node 3 started acting wierd. based on your Comment at 11h13 am the Node 3 was back online so this seem to have occured during the reboot from proxmox.
the only thing i know we did is to install Ceph on each node 3 days ago but not in production yet
 
so the logs on node1 are definitely incomplete (possibly not persisted to disk). the logs of node 2 and the involved tasks would still be interesting.

AFAICT:
- node 1 got fenced
- node 3 (CRM) triggered some edge case where node 1 oscillated between states fenced and unknown (likely source of the mails)
- node 1 got back up at 11:09
- node 3 went down at 11:09:55 (logs on node 3 incomplete, logs on node 2 are not posted, so no idea - could be the same cause as node 1 going down)
- node 3 comes backup at 11:13
- node 1 finally gets around to actually enabling HA again at 11:16
- node 3 takes a bit longer to go through the backlog of queued up notification mails

I see nothing there that resembles that old bug with a cluster-wide failure - the information is rather incomplete, but it looks more to me like your nodes are undersized to handle one of them going down, and thus recovery takes longer since the additional load can cause further problems including potentially taking down additional nodes down the line.

the node state oscillation in the HA logs of node 3 does look wrong though - could you provide the missing logs (node 2, tasks) and also "pveversion -v" of all three nodes? thanks!
 
so the logs on node1 are definitely incomplete (possibly not persisted to disk). the logs of node 2 and the involved tasks would still be interesting.

AFAICT:
- node 1 got fenced
- node 3 (CRM) triggered some edge case where node 1 oscillated between states fenced and unknown (likely source of the mails)
- node 1 got back up at 11:09
- node 3 went down at 11:09:55 (logs on node 3 incomplete, logs on node 2 are not posted, so no idea - could be the same cause as node 1 going down)
- node 3 comes backup at 11:13
- node 1 finally gets around to actually enabling HA again at 11:16
- node 3 takes a bit longer to go through the backlog of queued up notification mails

I see nothing there that resembles that old bug with a cluster-wide failure - the information is rather incomplete, but it looks more to me like your nodes are undersized to handle one of them going down, and thus recovery takes longer since the additional load can cause further problems including potentially taking down additional nodes down the line.

the node state oscillation in the HA logs of node 3 does look wrong though - could you provide the missing logs (node 2, tasks) and also "pveversion -v" of all three nodes? thanks!
hi Fabian, will try to send you Node2 detail today. Node 2 was restarted 1 hours earlier to test the installed Ceph drive on it sucessfuly , i was to install Ceph on node 1 , so i moved 2 small mikrotik router and a VM of 20GB that is a virtual gateway that we use for test only.
you mention not persisted to disk, would make sense. i seen that my collegue builded that 3 cluster node in ZFS mirror ( 2 SAS 300 rotated disk )
nothing is sleeping on that cluster excpect 4 Virtual mikrotik routeur of 128mb each with replication.

the issue occured when trying to move the 20GB disk, maybe this capped for some reason the SAS drives ( i/o limit ) ? as they are not SSDs ?
from node2 we where seeing nothing expect the node 1 crashing and node3 too after.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!