all nodes got rebooted and there is no log - Cluster disaster

Nov 8, 2017
99
3
13
33
Muscat
Hi folks
I have added a new node to my cluster today, then I realized the new node's network configuration might have an issue that it cannot communicate with the CEPH IP ranges, I have restarted the network service using "systemctl restart networking"
after this, the disaster happened and I realized all of my other 13 nodes got rebooted one by one and obviously all the VMs within them.

Once all nodes came back online, I saw unusual issues; one node was not able to ping others, one CEPH node was not up even though the OS booted and everything seemed ok.
So I thought it might be the new node that somehow did all these, so I shut it down and everything got back to normal.
I tried so hard to look at all logs (syslogs, daemon.log, kern.log,etc) but there was nothing, it seemed everything was ok and then suddenly a reboot happened.

It seems something triggered fencing on all nodes. but why and what?

I would really appreciate if you can help me find the root cause of this disaster
 
You need to provide more details (your pveversion, your logs, your config, etc).
 
Hi folks
I have added a new node to my cluster today, then I realized the new node's network configuration might have an issue that it cannot communicate with the CEPH IP ranges, I have restarted the network service using "systemctl restart networking"
after this, the disaster happened and I realized all of my other 13 nodes got rebooted one by one and obviously all the VMs within them.

Once all nodes came back online, I saw unusual issues; one node was not able to ping others, one CEPH node was not up even though the OS booted and everything seemed ok.
So I thought it might be the new node that somehow did all these, so I shut it down and everything got back to normal.
I tried so hard to look at all logs (syslogs, daemon.log, kern.log,etc) but there was nothing, it seemed everything was ok and then suddenly a reboot happened.

It seems something triggered fencing on all nodes. but why and what?

I would really appreciate if you can help me find the root cause of this disaster
UPDATE:
I've found this on one of the nodes:
watchdog-mux[1153]: client watchdog expired - disable watchdog updates


apparently something triggered the watchdog.
But how another node restarting its network service can cause this?
 
You need to provide more details (your pveversion, your logs, your config, etc).


The cluster has 13 nodes (14 with the new faulty node).
5 nodes have 10GB NIC and 100GBNIC and they are SSD Ceph nodes
3 Nodes with only 10GB NIC are HDD Ceph nodes
rest are just compute nodes
SYSLOGs and daemon.logs for that period are attached.

The reboots happened at 19:23, you can see in the logs that after the last line 19:23, it suddenly jumps to 19:27 and the system starts to boot up.
pveversion -v proxmox-ve: 5.4-2 (running kernel: 4.15.18-20-pve) pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec) pve-kernel-4.15: 5.4-8 pve-kernel-4.15.18-20-pve: 4.15.18-46 ceph: 12.2.12-pve1 corosync: 2.4.4-pve1 criu: 2.11.1-1~bpo90 glusterfs-client: 3.8.8-1 ksm-control-daemon: not correctly installed libjs-extjs: 6.0.1-2 libpve-access-control: 5.1-12 libpve-apiclient-perl: 2.0-5 libpve-common-perl: 5.0-55 libpve-guest-common-perl: 2.0-20 libpve-http-server-perl: 2.0-14 libpve-storage-perl: 5.0-44 libqb0: 1.0.3-1~bpo9 lvm2: 2.02.168-pve6 lxc-pve: 3.1.0-6 lxcfs: 3.0.3-pve1 novnc-pve: 1.0.0-3 proxmox-widget-toolkit: 1.0-28 pve-cluster: 5.0-38 pve-container: 2.0-40 pve-docs: 5.4-2 pve-edk2-firmware: 1.20190312-1 pve-firewall: 3.0-22 pve-firmware: 2.0-7 pve-ha-manager: 2.0-9 pve-i18n: 1.1-4 pve-libspice-server1: 0.14.1-2 pve-qemu-kvm: 3.0.1-4 pve-xtermjs: 3.12.0-1 qemu-server: 5.0-54 smartmontools: 6.5+svn4324-1 spiceterm: 3.0-5 vncterm: 1.5-3

SYSLOG
Jan 6 19:23:00 master2 kernel: [10305696.110336] sd 3:0:0:6: [sdh] Sense Key : Illegal Request [current] Jan 6 19:23:00 master2 kernel: [10305696.110338] sd 3:0:0:6: [sdh] Add. Sense: Logical unit not supported Jan 6 19:23:00 master2 kernel: [10305696.110824] sd 3:0:0:6: [sdh] Read Capacity(16) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Jan 6 19:23:00 master2 kernel: [10305696.110826] sd 3:0:0:6: [sdh] Sense Key : Illegal Request [current] Jan 6 19:23:00 master2 kernel: [10305696.110828] sd 3:0:0:6: [sdh] Add. Sense: Logical unit not supported Jan 6 19:23:00 master2 kernel: [10305696.111121] sd 3:0:0:6: [sdh] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Jan 6 19:23:00 master2 kernel: [10305696.111124] sd 3:0:0:6: [sdh] Sense Key : Illegal Request [current] Jan 6 19:23:00 master2 kernel: [10305696.111126] sd 3:0:0:6: [sdh] Add. Sense: Logical unit not supported @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ Jan 6 19:27:26 master2 systemd-modules-load[456]: Inserted module 'coretemp' Jan 6 19:27:26 master2 kernel: [ 0.000000] Linux version 4.15.18-20-pve (root@nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-46 (Thu, 8 Aug 2019 10:42:06 +0200) () Jan 6 19:27:26 master2 systemd[1]: Starting Flush Journal to Persistent Storage... Jan 6 19:27:26 master2 kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-20-pve root=/dev/mapper/pve-root ro quiet Jan 6 19:27:26 master2 systemd[1]: Mounted RPC Pipe File System. Jan 6 19:27:26 master2 kernel: [ 0.000000] KERNEL supported cpus: Jan 6 19:27:26 master2 kernel: [ 0.000000] Intel GenuineIntel Jan 6 19:27:26 master2 kernel: [ 0.000000] AMD AuthenticAMD Jan 6 19:27:26 master2 systemd[1]: Started Flush Journal to Persistent Storage. Jan 6 19:27:26 master2 kernel: [ 0.000000] Centaur CentaurHauls Jan 6 19:27:26 master2 kernel: [ 0.000000] x86/fpu: x87 FPU will use FXSAVE Jan 6 19:27:26 master2 kernel: [ 0.000000] e820: BIOS-provided physical RAM map: Jan 6 19:27:26 master2 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ebff] usable

DAEMON.LOG
Jan 6 19:21:00 master2 systemd[1]: Starting Proxmox VE replication runner... Jan 6 19:21:03 master2 pvesr[25575]: trying to acquire cfs lock 'file-replication_cfg' ... Jan 6 19:21:04 master2 systemd[1]: Started Proxmox VE replication runner. Jan 6 19:22:00 master2 systemd[1]: Starting Proxmox VE replication runner... Jan 6 19:22:01 master2 systemd[1]: Started Proxmox VE replication runner. Jan 6 19:23:00 master2 systemd[1]: Starting Proxmox VE replication runner... Jan 6 19:23:25 master2 watchdog-mux[1153]: client watchdog expired - disable watchdog updates Jan 6 19:27:26 master2 systemd-modules-load[456]: Inserted module 'coretemp' Jan 6 19:27:26 master2 systemd[1]: Starting Flush Journal to Persistent Storage... Jan 6 19:27:26 master2 systemd[1]: Mounted RPC Pipe File System. Jan 6 19:27:26 master2 systemd[1]: Started Flush Journal to Persistent Storage. Jan 6 19:27:26 master2 systemd[1]: Started Load/Save Random Seed. Jan 6 19:27:26 master2 systemd[1]: Started udev Coldplug all Devices. Jan 6 19:27:26 master2 systemd[1]: Starting udev Wait for Complete Device Initialization... Jan 6 19:27:26 master2 systemd[1]: Started Set the console keyboard layout. Jan 6 19:27:26 master2 systemd-modules-load[456]: Inserted module 'iscsi_tcp'
 

Attachments

Do you see anything regarding knet or corosync in the syslogs of the other nodes shortly before they rebooted? Something saying that they lost quorum or lost the connection to other nodes?

It could be possible that the new node caused too much corosync (the cluster management) traffic for your network which lead to nodes getting separated from the quorum part of the cluster. If you do have HA enabled this will cause the separated nodes to fence themselves (hard reset) if they cannot connect back to the cluster within ~2min.
 
  • Like
Reactions: Pourya Mehdinejad
Do you see anything regarding knet or corosync in the syslogs of the other nodes shortly before they rebooted? Something saying that they lost quorum or lost the connection to other nodes?

It could be possible that the new node caused too much corosync (the cluster management) traffic for your network which lead to nodes getting separated from the quorum part of the cluster. If you do have HA enabled this will cause the separated nodes to fence themselves (hard reset) if they cannot connect back to the cluster within ~2min.

Unfortunately, there was nothing in the logs except the watchdog warning "
watchdog-mux[1153]: client watchdog expired - disable watchdog updates"

Is there anywhere else I should check to find more logs about this?
Regarding your theory, I'm wondering how could that be possible, all nodes are connected through 2x 10Gb NIC (Bond), and corosync traffic shouldn't be that much that bring down the network for all nodes at once.
This incident was so bad that I don't dare to bring up the new node again.
If Proxmox team could help find the root cause, it would have been great.
 
Unfortunately, there was nothing in the logs except the watchdog warning "
watchdog-mux[1153]: client watchdog expired - disable watchdog updates"

Can you check if your watchdog device has been enabled with "nowayout"?
That would explain why the system has been forcefully reset, even if the daemon itself disconnects gracefully.

The message itself looks like a classical watchdog timeout.

I don't know the specific details of the PVE + watchdog-mux + corosync bundle, as I am more experienced with the "normal" watchdog package.
PVE v5 uses multicast based corosync, if you have IGMP enabled on your switch, quorum might get out of sync very easily (had the same problem with one installation). If watchdog-mux does not feed the watchdog device because of missing quorum, you only have some seconds until your nodes forcefully reboot.

IMHO
You can upgrade to the unicast version of corosync in running state, that should solve a lot of issues with multicast groups.
https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0#Cluster:_always_upgrade_to_Corosync_3_first
 
  • Like
Reactions: Pourya Mehdinejad
Can you check if your watchdog device has been enabled with "nowayout"?
That would explain why the system has been forcefully reset, even if the daemon itself disconnects gracefully.

The message itself looks like a classical watchdog timeout.

I don't know the specific details of the PVE + watchdog-mux + corosync bundle, as I am more experienced with the "normal" watchdog package.
PVE v5 uses multicast based corosync, if you have IGMP enabled on your switch, quorum might get out of sync very easily (had the same problem with one installation). If watchdog-mux does not feed the watchdog device because of missing quorum, you only have some seconds until your nodes forcefully reboot.

IMHO
You can upgrade to the unicast version of corosync in running state, that should solve a lot of issues with multicast groups.
https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0#Cluster:_always_upgrade_to_Corosync_3_first

Thanks for your info, It gave me some ideas.
But what you mean by my "watchdog device"?
According to what read , The system should be using the default Linux watchdog (Softdog).
Also there is no module on /etc/default/pve-ha-manager

Do you have any idea, where I can get more logs and detail about this watchdogs, I believe those are the cause of this.
Regarding the multicasts packets, we have checked the logs in the switches, but there was nothing unusual.
 
If you have server-grade hardware, you will have /dev/watchdog* interfaces.
They even exist on low-end intel chipsets, most servers feature an integration to IPMI.
A software watchdog would not make much sense, as they won't fire for most cases (but they exist).

Try these:
Bash:
root@vm2021 ~ # dmesg | grep -i watch
root@vm2021 ~ # fuser -v /dev/watchdog

There you will get more info about it.

Regarding the switches: You won't find find an error in their logs because they don't know that it's a problem. Did you check if IGMP is enabled?
For IGMP to work, each member needs to join the multicast group and the switches need to listen to these requests. It's working like a firewall, if they don't see you joining, you won't get data. If you disable IGMP, multicast will be much more often converted to broadcast, reaching the remaining nodes, giving the node a chance to officially rejoin.
At least this was the case for many setups that had a problem like you. Some were unable to add new nodes to the cluster, some had quorum issues.

Most cluster tools switched from multicast to unicast, as unicast will work best any time with any hardware as it's the most basic thing.

I would bet that corosync3 fixes this problem permanently for you. You have luck as there is an optimized release, built for PVE v6 upgrade, that also should work flawless for your v5 cluster. Upgrading to this release works fine while the whole cluster is online.
 
Thanks again for shedding light on this
Unfortunately, I didn't set any of the servers to rely on the hardware watchdogs:
this is the output of the commands you suggested:

Code:
#dmesg | grep -i watch
# fuser -v /dev/watchdog
                     USER        PID ACCESS COMMAND
/dev/watchdog:       root       1096 F.... watchdog-mux


I remember when I first started deploying Proxmox, I read that in its KBs that IGMP Snooping should be disabled to allow Multicasts. so now it is fully disabled in our switches.
I assume I have to go through the upgrade process to Proxmox 6, however, I doubt we can do it in the production cluster without interruption.
 
Thanks again for shedding light on this
Unfortunately, I didn't set any of the servers to rely on the hardware watchdogs:
this is the output of the commands you suggested:

Code:
#dmesg | grep -i watch
# fuser -v /dev/watchdog
                     USER        PID ACCESS COMMAND
/dev/watchdog:       root       1096 F.... watchdog-mux


I remember when I first started deploying Proxmox, I read that in its KBs that IGMP Snooping should be disabled to allow Multicasts. so now it is fully disabled in our switches.
I assume I have to go through the upgrade process to Proxmox 6, however, I doubt we can do it in the production cluster without interruption.
Again I remember reading this at "https://pve.proxmox.com/wiki/High_Availability"

During normal operation, ha-manager regularly resets the watchdog timer to prevent it from elapsing. If, due to a hardware fault or program error, the computer fails to reset the watchdog, the timer will elapse and triggers a reset of the whole server (reboot).

By default, all hardware watchdog modules are blocked for security reasons. They are like a loaded gun if not correctly initialized. To enable a hardware watchdog, you need to specify the module to load in /etc/default/pve-ha-manager, for example:

# select watchdog module (default is softdog)
WATCHDOG_MODULE=iTCO_wdt
So I commented out everything in my /etc/default/pve-ha-manager.
Because still I have no idea how to configure the hardware watchdog.
And apparently this softdog timer is so low that an undetected interruption by the multicast, can bring the whole cluster down.
 
The proxmox docs are correct but I do not share their opinion.
A watchdog device exactly works like they describe, even the software one.
I am mostly using the IPMI based watchdog devices as they support power cycle instead of reset (hardware will be turned off and back on vs. simple reset).

You don't neet to upgrade to PVE v6, the corosync upgrade is just a step in the upgrade process but you can take a break after that step.
Upgrading corosync and PVE are two different things - PVE v6 just depends on this, not the other way around.
 
  • Like
Reactions: Pourya Mehdinejad
The proxmox docs are correct but I do not share their opinion.
A watchdog device exactly works like they describe, even the software one.
I am mostly using the IPMI based watchdog devices as they support power cycle instead of reset (hardware will be turned off and back on vs. simple reset).

You don't neet to upgrade to PVE v6, the corosync upgrade is just a step in the upgrade process but you can take a break after that step.
Upgrading corosync and PVE are two different things - PVE v6 just depends on this, not the other way around.
I really appreciate the time you put to post here, they really helped me.
as the last question, can you give me a clue, if you configure those IPMI watchdog before using them? or basically just adding them in /etc/default/pve-ha-manager
 
I really appreciate the time you put to post here, they really helped me.
as the last question, can you give me a clue, if you configure those IPMI watchdog before using them? or basically just adding them in /etc/default/pve-ha-manager

You need to install openipmi and ipmitool using APT. Then you need to adjust this file:
/etc/default/openipmi
https://bitbucket.org/code-orange/d...mplates/config-fs/static/etc/default/openipmi

You need to blacklist all other watchdog drivers, like iTCO (you can find this on the net).
Most watchdog daemons use /dev/watchdog which is connected to /dev/watchdog0. All other (1+) will also be detected but most daemons don't use them. As the IPMI watchdog is better (IMHO), you want to make sure, the kernel only load that.
 
  • Like
Reactions: Pourya Mehdinejad
We have now upgraded the to Corosync version 3, so, fortunately, no more multi-tasks.
Surprisingly, today while I was adding a new node to the cluster, the node that previously made the whole cluster to reboot (mentioned in this post) got rebooted automated.
While looking at the logs I found these. It seems that it lost its quorum, but I can't understand why?
And the node becomes normal after a reboot.


an 30 17:32:16 master14 corosync[5859]: [TOTEM ] A new membership (e.b0) was formed. Members left: 1 2 3 4 5 6 7 8 9 10 11 12 13
Jan 30 17:32:16 master14 corosync[5859]: [TOTEM ] Failed to receive the leave message. failed: 2 3 4 5 6 7 8 9 10 11 12 13
Jan 30 17:32:20 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 10
Jan 30 17:32:21 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 20
Jan 30 17:32:22 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 30
Jan 30 17:32:40 master14 pmxcfs[5881]: [status] crit: cpg_send_message failed: 6
Jan 30 17:32:40 master14 pve-firewall[2306]: firewall update time (15.675 seconds)
Jan 30 17:32:41 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 10
Jan 30 17:32:42 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 20
Jan 30 17:32:43 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 30
Jan 30 17:32:44 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 40
Jan 30 17:32:45 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 50
Jan 30 17:32:46 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 60
Jan 30 17:32:47 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 70
Jan 30 17:32:48 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 80
Jan 30 17:32:49 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 90
Jan 30 17:32:50 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 100
Jan 30 17:32:50 master14 pmxcfs[5881]: [status] notice: cpg_send_message retried 100 times
Jan 30 17:32:50 master14 pmxcfs[5881]: [status] crit: cpg_send_message failed: 6
Jan 30 17:32:51 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 10
Jan 30 17:32:52 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 20
Jan 30 17:32:53 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 30
Jan 30 17:32:54 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 40
Jan 30 17:32:55 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 50
Jan 30 17:32:56 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 60
Jan 30 17:32:56 master14 watchdog-mux[1337]: client watchdog expired - disable watchdog updates
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!