all nodes got rebooted and there is no log - Cluster disaster

Pourya Mehdinejad · Jan 6, 2020

Hi folks
I have added a new node to my cluster today, then I realized the new node's network configuration might have an issue that it cannot communicate with the CEPH IP ranges, I have restarted the network service using "systemctl restart networking"
after this, the disaster happened and I realized all of my other 13 nodes got rebooted one by one and obviously all the VMs within them.

Once all nodes came back online, I saw unusual issues; one node was not able to ping others, one CEPH node was not up even though the OS booted and everything seemed ok.
So I thought it might be the new node that somehow did all these, so I shut it down and everything got back to normal.
I tried so hard to look at all logs (syslogs, daemon.log, kern.log,etc) but there was nothing, it seemed everything was ok and then suddenly a reboot happened.

It seems something triggered fencing on all nodes. but why and what?

I would really appreciate if you can help me find the root cause of this disaster

tom · Jan 6, 2020

You need to provide more details (your pveversion, your logs, your config, etc).

Pourya Mehdinejad · Jan 6, 2020

Pourya Mehdinejad said:
Hi folks
I have added a new node to my cluster today, then I realized the new node's network configuration might have an issue that it cannot communicate with the CEPH IP ranges, I have restarted the network service using "systemctl restart networking"
after this, the disaster happened and I realized all of my other 13 nodes got rebooted one by one and obviously all the VMs within them.

Once all nodes came back online, I saw unusual issues; one node was not able to ping others, one CEPH node was not up even though the OS booted and everything seemed ok.
So I thought it might be the new node that somehow did all these, so I shut it down and everything got back to normal.
I tried so hard to look at all logs (syslogs, daemon.log, kern.log,etc) but there was nothing, it seemed everything was ok and then suddenly a reboot happened.

It seems something triggered fencing on all nodes. but why and what?

I would really appreciate if you can help me find the root cause of this disaster

UPDATE:
I've found this on one of the nodes:
watchdog-mux[1153]: client watchdog expired - disable watchdog updates

apparently something triggered the watchdog.
But how another node restarting its network service can cause this?

Pourya Mehdinejad · Jan 6, 2020

tom said:
You need to provide more details (your pveversion, your logs, your config, etc).

The cluster has 13 nodes (14 with the new faulty node).
5 nodes have 10GB NIC and 100GBNIC and they are SSD Ceph nodes
3 Nodes with only 10GB NIC are HDD Ceph nodes
rest are just compute nodes
SYSLOGs and daemon.logs for that period are attached.

The reboots happened at 19:23, you can see in the logs that after the last line 19:23, it suddenly jumps to 19:27 and the system starts to boot up.


pveversion -v
proxmox-ve: 5.4-2 (running kernel: 4.15.18-20-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-8
pve-kernel-4.15.18-20-pve: 4.15.18-46
ceph: 12.2.12-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-55
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-6
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-40
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-54
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

SYSLOG


Jan  6 19:23:00 master2 kernel: [10305696.110336] sd 3:0:0:6: [sdh] Sense Key : Illegal Request [current] 
Jan  6 19:23:00 master2 kernel: [10305696.110338] sd 3:0:0:6: [sdh] Add. Sense: Logical unit not supported
Jan  6 19:23:00 master2 kernel: [10305696.110824] sd 3:0:0:6: [sdh] Read Capacity(16) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan  6 19:23:00 master2 kernel: [10305696.110826] sd 3:0:0:6: [sdh] Sense Key : Illegal Request [current] 
Jan  6 19:23:00 master2 kernel: [10305696.110828] sd 3:0:0:6: [sdh] Add. Sense: Logical unit not supported
Jan  6 19:23:00 master2 kernel: [10305696.111121] sd 3:0:0:6: [sdh] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan  6 19:23:00 master2 kernel: [10305696.111124] sd 3:0:0:6: [sdh] Sense Key : Illegal Request [current] 
Jan  6 19:23:00 master2 kernel: [10305696.111126] sd 3:0:0:6: [sdh] Add. Sense: Logical unit not supported
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                                                     Jan  6 19:27:26 master2 systemd-modules-load[456]: Inserted module 'coretemp'
Jan  6 19:27:26 master2 kernel: [    0.000000] Linux version 4.15.18-20-pve (root@nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-46 (Thu, 8 Aug 2019 10:42:06 +0200) ()
Jan  6 19:27:26 master2 systemd[1]: Starting Flush Journal to Persistent Storage...
Jan  6 19:27:26 master2 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-20-pve root=/dev/mapper/pve-root ro quiet
Jan  6 19:27:26 master2 systemd[1]: Mounted RPC Pipe File System.
Jan  6 19:27:26 master2 kernel: [    0.000000] KERNEL supported cpus:
Jan  6 19:27:26 master2 kernel: [    0.000000]   Intel GenuineIntel
Jan  6 19:27:26 master2 kernel: [    0.000000]   AMD AuthenticAMD
Jan  6 19:27:26 master2 systemd[1]: Started Flush Journal to Persistent Storage.
Jan  6 19:27:26 master2 kernel: [    0.000000]   Centaur CentaurHauls
Jan  6 19:27:26 master2 kernel: [    0.000000] x86/fpu: x87 FPU will use FXSAVE
Jan  6 19:27:26 master2 kernel: [    0.000000] e820: BIOS-provided physical RAM map:
Jan  6 19:27:26 master2 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ebff] usable

DAEMON.LOG


Jan  6 19:21:00 master2 systemd[1]: Starting Proxmox VE replication runner...
Jan  6 19:21:03 master2 pvesr[25575]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 19:21:04 master2 systemd[1]: Started Proxmox VE replication runner.
Jan  6 19:22:00 master2 systemd[1]: Starting Proxmox VE replication runner...
Jan  6 19:22:01 master2 systemd[1]: Started Proxmox VE replication runner.
Jan  6 19:23:00 master2 systemd[1]: Starting Proxmox VE replication runner...
Jan  6 19:23:25 master2 watchdog-mux[1153]: client watchdog expired - disable watchdog updates
Jan  6 19:27:26 master2 systemd-modules-load[456]: Inserted module 'coretemp'
Jan  6 19:27:26 master2 systemd[1]: Starting Flush Journal to Persistent Storage...
Jan  6 19:27:26 master2 systemd[1]: Mounted RPC Pipe File System.
Jan  6 19:27:26 master2 systemd[1]: Started Flush Journal to Persistent Storage.
Jan  6 19:27:26 master2 systemd[1]: Started Load/Save Random Seed.
Jan  6 19:27:26 master2 systemd[1]: Started udev Coldplug all Devices.
Jan  6 19:27:26 master2 systemd[1]: Starting udev Wait for Complete Device Initialization...
Jan  6 19:27:26 master2 systemd[1]: Started Set the console keyboard layout.
Jan  6 19:27:26 master2 systemd-modules-load[456]: Inserted module 'iscsi_tcp'

narrateourale · Jan 7, 2020

Do you see anything regarding knet or corosync in the syslogs of the other nodes shortly before they rebooted? Something saying that they lost quorum or lost the connection to other nodes?

It could be possible that the new node caused too much corosync (the cluster management) traffic for your network which lead to nodes getting separated from the quorum part of the cluster. If you do have HA enabled this will cause the separated nodes to fence themselves (hard reset) if they cannot connect back to the cluster within ~2min.

Pourya Mehdinejad · Jan 7, 2020

narrateourale said:
Do you see anything regarding knet or corosync in the syslogs of the other nodes shortly before they rebooted? Something saying that they lost quorum or lost the connection to other nodes?

It could be possible that the new node caused too much corosync (the cluster management) traffic for your network which lead to nodes getting separated from the quorum part of the cluster. If you do have HA enabled this will cause the separated nodes to fence themselves (hard reset) if they cannot connect back to the cluster within ~2min.

Unfortunately, there was nothing in the logs except the watchdog warning "
watchdog-mux[1153]: client watchdog expired - disable watchdog updates"

Is there anywhere else I should check to find more logs about this?
Regarding your theory, I'm wondering how could that be possible, all nodes are connected through 2x 10Gb NIC (Bond), and corosync traffic shouldn't be that much that bring down the network for all nodes at once.
This incident was so bad that I don't dare to bring up the new node again.
If Proxmox team could help find the root cause, it would have been great.

syfy323 · Jan 8, 2020

Pourya Mehdinejad said:
Unfortunately, there was nothing in the logs except the watchdog warning "
watchdog-mux[1153]: client watchdog expired - disable watchdog updates"

Can you check if your watchdog device has been enabled with "nowayout"?
That would explain why the system has been forcefully reset, even if the daemon itself disconnects gracefully.

The message itself looks like a classical watchdog timeout.

I don't know the specific details of the PVE + watchdog-mux + corosync bundle, as I am more experienced with the "normal" watchdog package.
PVE v5 uses multicast based corosync, if you have IGMP enabled on your switch, quorum might get out of sync very easily (had the same problem with one installation). If watchdog-mux does not feed the watchdog device because of missing quorum, you only have some seconds until your nodes forcefully reboot.

IMHO
You can upgrade to the unicast version of corosync in running state, that should solve a lot of issues with multicast groups.
https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0#Cluster:_always_upgrade_to_Corosync_3_first

Pourya Mehdinejad · Jan 9, 2020

syfy323 said:
Can you check if your watchdog device has been enabled with "nowayout"?
That would explain why the system has been forcefully reset, even if the daemon itself disconnects gracefully.

The message itself looks like a classical watchdog timeout.

I don't know the specific details of the PVE + watchdog-mux + corosync bundle, as I am more experienced with the "normal" watchdog package.
PVE v5 uses multicast based corosync, if you have IGMP enabled on your switch, quorum might get out of sync very easily (had the same problem with one installation). If watchdog-mux does not feed the watchdog device because of missing quorum, you only have some seconds until your nodes forcefully reboot.

IMHO
You can upgrade to the unicast version of corosync in running state, that should solve a lot of issues with multicast groups.
https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0#Cluster:_always_upgrade_to_Corosync_3_first

Thanks for your info, It gave me some ideas.
But what you mean by my "watchdog device"?
According to what read , The system should be using the default Linux watchdog (Softdog).
Also there is no module on /etc/default/pve-ha-manager

Do you have any idea, where I can get more logs and detail about this watchdogs, I believe those are the cause of this.
Regarding the multicasts packets, we have checked the logs in the switches, but there was nothing unusual.

syfy323 · Jan 9, 2020

If you have server-grade hardware, you will have /dev/watchdog* interfaces.
They even exist on low-end intel chipsets, most servers feature an integration to IPMI.
A software watchdog would not make much sense, as they won't fire for most cases (but they exist).

Try these:

Bash:

root@vm2021 ~ # dmesg | grep -i watch
root@vm2021 ~ # fuser -v /dev/watchdog

There you will get more info about it.

Regarding the switches: You won't find find an error in their logs because they don't know that it's a problem. Did you check if IGMP is enabled?
For IGMP to work, each member needs to join the multicast group and the switches need to listen to these requests. It's working like a firewall, if they don't see you joining, you won't get data. If you disable IGMP, multicast will be much more often converted to broadcast, reaching the remaining nodes, giving the node a chance to officially rejoin.
At least this was the case for many setups that had a problem like you. Some were unable to add new nodes to the cluster, some had quorum issues.

Most cluster tools switched from multicast to unicast, as unicast will work best any time with any hardware as it's the most basic thing.

I would bet that corosync3 fixes this problem permanently for you. You have luck as there is an optimized release, built for PVE v6 upgrade, that also should work flawless for your v5 cluster. Upgrading to this release works fine while the whole cluster is online.

Pourya Mehdinejad · Jan 9, 2020

Thanks again for shedding light on this
Unfortunately, I didn't set any of the servers to rely on the hardware watchdogs:
this is the output of the commands you suggested:

Code:

#dmesg | grep -i watch
# fuser -v /dev/watchdog
                     USER        PID ACCESS COMMAND
/dev/watchdog:       root       1096 F.... watchdog-mux

I remember when I first started deploying Proxmox, I read that in its KBs that IGMP Snooping should be disabled to allow Multicasts. so now it is fully disabled in our switches.
I assume I have to go through the upgrade process to Proxmox 6, however, I doubt we can do it in the production cluster without interruption.

Pourya Mehdinejad · Jan 9, 2020

Pourya Mehdinejad said:
Thanks again for shedding light on this
Unfortunately, I didn't set any of the servers to rely on the hardware watchdogs:
this is the output of the commands you suggested:

Code:

#dmesg | grep -i watch # fuser -v /dev/watchdog USER PID ACCESS COMMAND /dev/watchdog: root 1096 F.... watchdog-mux

I remember when I first started deploying Proxmox, I read that in its KBs that IGMP Snooping should be disabled to allow Multicasts. so now it is fully disabled in our switches.
I assume I have to go through the upgrade process to Proxmox 6, however, I doubt we can do it in the production cluster without interruption.

Again I remember reading this at "https://pve.proxmox.com/wiki/High_Availability"

During normal operation, ha-manager regularly resets the watchdog timer to prevent it from elapsing. If, due to a hardware fault or program error, the computer fails to reset the watchdog, the timer will elapse and triggers a reset of the whole server (reboot).

By default, all hardware watchdog modules are blocked for security reasons. They are like a loaded gun if not correctly initialized. To enable a hardware watchdog, you need to specify the module to load in /etc/default/pve-ha-manager, for example:

# select watchdog module (default is softdog)
WATCHDOG_MODULE=iTCO_wdt

So I commented out everything in my /etc/default/pve-ha-manager.
Because still I have no idea how to configure the hardware watchdog.
And apparently this softdog timer is so low that an undetected interruption by the multicast, can bring the whole cluster down.

syfy323 · Jan 9, 2020

The proxmox docs are correct but I do not share their opinion.
A watchdog device exactly works like they describe, even the software one.
I am mostly using the IPMI based watchdog devices as they support power cycle instead of reset (hardware will be turned off and back on vs. simple reset).

You don't neet to upgrade to PVE v6, the corosync upgrade is just a step in the upgrade process but you can take a break after that step.
Upgrading corosync and PVE are two different things - PVE v6 just depends on this, not the other way around.

Pourya Mehdinejad · Jan 9, 2020

syfy323 said:
The proxmox docs are correct but I do not share their opinion.
A watchdog device exactly works like they describe, even the software one.
I am mostly using the IPMI based watchdog devices as they support power cycle instead of reset (hardware will be turned off and back on vs. simple reset).

You don't neet to upgrade to PVE v6, the corosync upgrade is just a step in the upgrade process but you can take a break after that step.
Upgrading corosync and PVE are two different things - PVE v6 just depends on this, not the other way around.

I really appreciate the time you put to post here, they really helped me.
as the last question, can you give me a clue, if you configure those IPMI watchdog before using them? or basically just adding them in /etc/default/pve-ha-manager

syfy323 · Jan 9, 2020

Pourya Mehdinejad said:
I really appreciate the time you put to post here, they really helped me.
as the last question, can you give me a clue, if you configure those IPMI watchdog before using them? or basically just adding them in /etc/default/pve-ha-manager

You need to install openipmi and ipmitool using APT. Then you need to adjust this file:
/etc/default/openipmi
https://bitbucket.org/code-orange/d...mplates/config-fs/static/etc/default/openipmi

You need to blacklist all other watchdog drivers, like iTCO (you can find this on the net).
Most watchdog daemons use /dev/watchdog which is connected to /dev/watchdog0. All other (1+) will also be detected but most daemons don't use them. As the IPMI watchdog is better (IMHO), you want to make sure, the kernel only load that.

spirit · Jan 10, 2020

https://pve.proxmox.com/wiki/High_Availability_Cluster_4.x

Pourya Mehdinejad · Jan 30, 2020

We have now upgraded the to Corosync version 3, so, fortunately, no more multi-tasks.
Surprisingly, today while I was adding a new node to the cluster, the node that previously made the whole cluster to reboot (mentioned in this post) got rebooted automated.
While looking at the logs I found these. It seems that it lost its quorum, but I can't understand why?
And the node becomes normal after a reboot.

an 30 17:32:16 master14 corosync[5859]: [TOTEM ] A new membership (e.b0) was formed. Members left: 1 2 3 4 5 6 7 8 9 10 11 12 13
Jan 30 17:32:16 master14 corosync[5859]: [TOTEM ] Failed to receive the leave message. failed: 2 3 4 5 6 7 8 9 10 11 12 13
Jan 30 17:32:20 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 10
Jan 30 17:32:21 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 20
Jan 30 17:32:22 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 30
Jan 30 17:32:40 master14 pmxcfs[5881]: [status] crit: cpg_send_message failed: 6
Jan 30 17:32:40 master14 pve-firewall[2306]: firewall update time (15.675 seconds)
Jan 30 17:32:41 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 10
Jan 30 17:32:42 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 20
Jan 30 17:32:43 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 30
Jan 30 17:32:44 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 40
Jan 30 17:32:45 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 50
Jan 30 17:32:46 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 60
Jan 30 17:32:47 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 70
Jan 30 17:32:48 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 80
Jan 30 17:32:49 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 90
Jan 30 17:32:50 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 100
Jan 30 17:32:50 master14 pmxcfs[5881]: [status] notice: cpg_send_message retried 100 times
Jan 30 17:32:50 master14 pmxcfs[5881]: [status] crit: cpg_send_message failed: 6
Jan 30 17:32:51 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 10
Jan 30 17:32:52 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 20
Jan 30 17:32:53 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 30
Jan 30 17:32:54 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 40
Jan 30 17:32:55 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 50
Jan 30 17:32:56 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 60
Jan 30 17:32:56 master14 watchdog-mux[1337]: client watchdog expired - disable watchdog updates

dongyong945 · Jan 11, 2021

您解决了问题吗？怎么处理呢？我也有同样的问题。我需要帮助。谢谢。

Search

Search

all nodes got rebooted and there is no log - Cluster disaster

Pourya Mehdinejad

Member

tom

Proxmox Staff Member

Pourya Mehdinejad

Member

Pourya Mehdinejad

Member

Attachments

narrateourale

Well-Known Member

Pourya Mehdinejad

Member

syfy323

Member

Pourya Mehdinejad

Member

syfy323

Member

Pourya Mehdinejad

Member

Pourya Mehdinejad

Member

syfy323

Member

Pourya Mehdinejad

Member

syfy323

Member

spirit

Distinguished Member

Pourya Mehdinejad

Member

dongyong945

Member