Proxmox hyper-converged (ceph) cascading failure upon single node crash/power loss

jw6677 · Mar 2, 2021

Hello,

I seem to be having troubles with failover / redundancy that I am hoping someone in the community might be able to help me understand.

I have a four node cluster, which I am working to ensure high-availability of the vms and containers being managed.
This is a hobby cluster in my garage, so nothing mission critical, and as such, I am not averse to intentionally taking nodes offline as in the below test.

The issue appears to be a cascade of failures!

See the below screenshot of the state of my cluster 5 mins after intentionally killing a single node (dl380g7):

For testing, I have the ceph "noout" global flag enabled, to avoid shuffling data around. This is why all osds are "in".

It appears that the failure of dl380g7 causes a handful of OSDs on other machines to fail as well, which in turn causes everything to lock up once I drop below the min_size of my pools.
This then leads to an apparent crash of pvestatd? Though the service remains active.

From there, everything is a gong show of restarting services to get it all back online.

Can anyone shed some light on what might be happening here, and how I am best to go about debugging?

aaron · Mar 2, 2021

Those OSDs on the remaining nodes should stay alive.

How many Ceph monitors and Ceph managers do you have in the cluster?

Anything in the logs of the OSDs that give a hint why they failed? /var/log/ceph/ceph-osd.X.log
You probably want to filter any line with "rocksdb" as it is quite spammy and might drown the actual information you are looking for.

jw6677 · Mar 2, 2021

I have three Mons, sometimes I've had four, but that didn't seem to impact anything.
Similarly, three managers, and three MDS.

Dang, seems like the osd logs were wiped when the osds restarted, but I do have ceph.log, which shows the timeline of the issue.

(I've heavily filtered the attached to drop all the spammy logs which appeared only tangentially related)

A few items jump out at me:

Code:

2021-03-02T06:41:21.122549-0700 osd.7 (osd.7) 696 : cluster [ERR] 21.25c missing primary copy of 21:3a43c950:::100008413bf.00000000:head, will try copies on 1,24

first sign of trouble at 06:41:21

Code:

2021-03-02T06:41:53.361979-0700 mon.rd240 (mon.0) 42736 : cluster [WRN] Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
2021-03-02T06:41:53.361995-0700 mon.rd240 (mon.0) 42737 : cluster [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST)

Complaints about operations taking too long start to appear

Code:

2021-03-02T06:41:53.399466-0700 mon.rd240 (mon.0) 42739 : cluster [WRN] Replacing daemon mds.dl380g7 as rank 0 with standby daemon mds.server
2021-03-02T06:41:53.399514-0700 mon.rd240 (mon.0) 42740 : cluster [INF] MDS daemon mds.dl380g7 is removed because it is dead or otherwise unavailable.
2021-03-02T06:41:53.402578-0700 mon.rd240 (mon.0) 42741 : cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)
2021-03-02T06:41:53.402601-0700 mon.rd240 (mon.0) 42742 : cluster [DBG] fsmap cephfs:2 {0=dl380g7=up:active,1=freenas=up:active} 2 up:standby
2021-03-02T06:41:53.426781-0700 mon.rd240 (mon.0) 42743 : cluster [DBG] osdmap e493812: 29 total, 29 up, 29 in

dl380g7 (downed node) mds is replaced with a standby
filesystem degraded state
and OSDs all still marked up(?)

Code:

2021-03-02T06:42:19.162500-0700 mon.rd240 (mon.0) 42766 : cluster [WRN] Health check update: 206 slow ops, oldest one blocked for 55 sec, daemons [osd.0,osd.1,osd.11,osd.12,osd.14,osd.15,osd.18,osd.2,osd.21,osd.22]... have slow ops. (SLOW_OPS)
2021-03-02T06:42:24.164565-0700 mon.rd240 (mon.0) 42767 : cluster [WRN] Health check update: 290 slow ops, oldest one blocked for 59 sec, daemons [osd.0,osd.1,osd.11,osd.12,osd.14,osd.15,osd.18,osd.2,osd.21,osd.22]... have slow ops. (SLOW_OPS)
2021-03-02T06:42:24.164596-0700 mon.rd240 (mon.0) 42768 : cluster [WRN] Replacing daemon mds.freenas as rank 1 with standby daemon mds.rd240
2021-03-02T06:42:24.164624-0700 mon.rd240 (mon.0) 42769 : cluster [INF] MDS daemon mds.freenas is removed because it is dead or otherwise unavailable.
2021-03-02T06:42:24.167631-0700 mon.rd240 (mon.0) 42770 : cluster [WRN] Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY)

Secondary MDS.freenas (active node) is dead?
(I've now set `ceph fs set cephfs max_mds 1` to increase availability of MDS in case of failure, but it is unclear why the mds became unavailable...)

Code:

2021-03-02T06:48:56.066459-0700 osd.0 (osd.0) 2202 : cluster [WRN] Monitor daemon marked osd.0 down, but it is still running
2021-03-02T06:48:56.066465-0700 osd.0 (osd.0) 2203 : cluster [DBG] map e493827 wrongly marked me down at e493827

It seems like surviving OSDs are marking each other as down?

So, this was a helpful exercise, thank you!

New questions arise which I'll be digging into, and any pointers would be greatly appreciated as well:

Why did an unrelated mds.freenas become unavailable?
--> This smells of a networking issue, but I confirmed that there isn't any obvious routing through the downed node..

Does it make sense to run multiple mds per node, one per network maybe? (I have an cluster 10.xx.xx.xx and network 192.168.xx.xx)

Does the issue reoccur if I kill other nodes, or is it just luck of the draw that this one can't handle it for some reason...

JW

aaron · Mar 3, 2021

First, can you explain your network setup and maybe share the /etc/network/interfaces file? I do think that the root problem for this unexpected behavior can be found there.

Secondly, one MDS per node is enough. There can only be one MDS and one MGR active at all times.

jw6677 · Mar 3, 2021

Absolutely, I an running an infiniband cluster network via ib_ipoib for cluster communication and corosync, otherwise primarily just normal LAN..

Code:

auto lo
iface lo inet loopback

auto eno1
iface eno1 inet manual
#going to vmbr0

auto eno2
iface eno2 inet manual
#going to vmbr0

auto eno3
iface eno3 inet static
    address 192.168.B.3/24
#backup static access to lan

auto eno4
iface eno4 inet manual
#going to vmbr1


auto ibp65s0
iface ibp65s0 inet static
    address 192.168.A.A/24
    mtu 65520
    pre-up modprobe ib_ipoib
    pre-up echo connected > /sys/class/net/ibp65s0/mode
    post-up echo 1 > /proc/sys/net/ipv4/ip_forward
    post-up /sbin/ip link set dev ibp65s0 txqueuelen 10000
    post-up   iptables -t nat -A POSTROUTING -s '192.168.2.0/24' -o ibp65s0 -j MASQUERADE
    post-down iptables -t nat -D POSTROUTING -s '192.168.2.0/24' -o ibp65s0 -j MASQUERADE
#Primary cluster communication

auto ibp65s0d1
iface ibp65s0d1 inet static
        address 10.x.x.0/24
        mtu 65520
        #mtu 2044
        pre-up modprobe ib_ipoib
        pre-up echo connected > /sys/class/net/ibp65s0d1/mode
        post-up echo 1 > /proc/sys/net/ipv4/ip_forward
        post-up /sbin/ip link set dev ibp65s0d1 txqueuelen 10000
        post-up   iptables -t nat -A POSTROUTING -s '10.x.x.0/24' -o ibp65s0d1 -j MASQUERADE
        post-down iptables -t nat -D POSTROUTING -s '10.x.x.0/24' -o ibp65s0d1 -j MASQUERADE
# Corosync Communication

auto ibp65s0d1.8020
iface ibp65s0d1.8020 inet static
        address 192.168.A.A/24
        #mtu 2044
        mtu 65520
        pre-up modprobe ib_ipoib
        pre-up echo connected > /sys/class/net/ibp65s0d1.8020/mode
        post-up /sbin/ip link set dev ibp65s0d1.8020 txqueuelen 10000

auto vmbr0
iface vmbr0 inet dhcp # statically assigned at router
    address 192.168.B.B/24
    gateway 192.168.B.1
    broadcast 192.168.10.255
    bridge-ports eno1 eno2
    bridge-stp off
    bridge-fd 0
    post-up echo 1 > /proc/sys/net/ipv4/ip_forward

auto vmbr1
iface vmbr1 inet dhcp # statically assigned at router
    bridge-ports eno4
    bridge-stp off
    bridge-fd 0
    post-up echo 1 > /proc/sys/net/ipv4/ip_forward

jw6677 · Mar 10, 2021

Update here,

I beleive that issue is tracked down to the opensm subnet manager used to run the infiniband network.

When the active SM went offline, everything else went to hell before the new SM picked up, which I am not totally use how to resolve, but that's where I appear to be at.

aaron · Mar 19, 2021

Are the disks for the OSDs locally in the nodes or are they located at another storage that is also connected via Infiniband?

Search

Search

Proxmox hyper-converged (ceph) cascading failure upon single node crash/power loss

jw6677

Active Member

aaron

Proxmox Staff Member

jw6677

Active Member

Attachments

aaron

Proxmox Staff Member

jw6677

Active Member

jw6677

Active Member

aaron

Proxmox Staff Member