Hyperconverged cluster VMs become inaccessible when one node is down

lightnet-barry · May 18, 2023

I have had a strange situation occur with a new cluster I have built.

In order to update the BMC firmware I needed to cold reset each node in turn. I migrated all VMs from node A to node B, checked they were all running, confirmed services on VMs were available as expected and then powered down node A.

Immediately all services running on the VMs became unavailable, all IP addresses were unreachable and while the VMs showed as running on the GUI, the console was unavailable via vncproxy (e.g. VM 149 qmp command 'set_password' failed - unable to connect to VM 149 qmp socket - timeout after 51 retries).

As soon as node A was powered back up everything return to normal...

I have multiple separate networks, eg corosync, migration, data (via vmbr0), ceph, etc, all via separate switches

Any help appreciated

leesteken · May 18, 2023

Does this only happen with node A or also when you take other nodes (other than A and B) offline? If you have a cluster of two nodes then this is expected behavior.
Is there any information about quorum in the system logs or any error messages around that time?

lightnet-barry · May 18, 2023

It happens with each node in turn as I power them down, A, B & C (3 node cluster).
pvecm showed cluster as quorate with 2 out 3 (expected)

lightnet-barry · May 18, 2023

This is from when node C was down:

Code:

# pvecm status
Cluster information
-------------------
Name:             xxx-proxmox
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu May 18 07:58:20 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.66
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 172.xxx.xxx.51 (local)
0x00000002          1 172.xxx.xxx.52

LnxBil · May 18, 2023

What about your crush map? Maybe lies the problem there?

alexskysilk · May 19, 2023

might want to post your OSD map (ceph osd tree.) I suspect the answer is there.

lightnet-barry · May 19, 2023

LnxBil said:
What about your crush map? Maybe lies the problem there?

This is my crush map. I'd appreciate any opinions though I 'm not sure this is the the issue (see next comment)

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host xxxx-proxmox1 {
    id -3        # do not change unnecessarily
    id -4 class ssd        # do not change unnecessarily
    # weight 13.97235
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 3.49309
    item osd.1 weight 3.49309
    item osd.2 weight 3.49309
    item osd.3 weight 3.49309
}
host xxxx-proxmox2 {
    id -5        # do not change unnecessarily
    id -6 class ssd        # do not change unnecessarily
    # weight 13.97235
    alg straw2
    hash 0    # rjenkins1
    item osd.4 weight 3.49309
    item osd.5 weight 3.49309
    item osd.6 weight 3.49309
    item osd.7 weight 3.49309
}
host xxxx-proxmox3 {
    id -7        # do not change unnecessarily
    id -8 class ssd        # do not change unnecessarily
    # weight 13.97235
    alg straw2
    hash 0    # rjenkins1
    item osd.8 weight 3.49309
    item osd.9 weight 3.49309
    item osd.10 weight 3.49309
    item osd.11 weight 3.49309
}
root default {
    id -1        # do not change unnecessarily
    id -2 class ssd        # do not change unnecessarily
    # weight 41.91705
    alg straw2
    hash 0    # rjenkins1
    item xxxx-proxmox1 weight 13.97235
    item xxxx-proxmox2 weight 13.97235
    item xxxx-proxmox3 weight 13.97235
}

# rules
rule replicated_rule {
    id 0
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

lightnet-barry · May 19, 2023

I think the issue is to do with the status of OSDs when I power down one of my nodes.
If I plan the reboot:

set nodown, noout
mark OSDs as out, then down

my OSDs still show as up.

Code:

root@xxxx-proxmox1:~# ceph osd status
ID  HOST            USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
 0  xxxx-proxmox1   575G  3001G     43      392k      1        0   exists,up
 1  xxxx-proxmox1   671G  2905G     18      709k      0        0   exists,up
 2  xxxx-proxmox1   520G  3056G      4     33.5k      0        0   exists,up
 3  xxxx-proxmox1   613G  2963G     16      209k      1        0   exists,up
 4  xxxx-proxmox2   577G  2999G     10     45.5k      1        0   exists,up
 5  xxxx-proxmox2   576G  3000G     29     1188k      2        0   exists,up
 6  xxxx-proxmox2   558G  3018G     20      608k      1        0   exists,up
 7  xxxx-proxmox2   670G  2906G     10      105k      1        0   exists,up
 8  xxxx-proxmox3     0      0       0        0       0        0   exists,up
 9  xxxx-proxmox3     0      0       0        0       0        0   exists,up
10  xxxx-proxmox3     0      0       0        0       0        0   exists,up
11  xxxx-proxmox3     0      0       0        0       0        0   exists,up

If I accidently just reboot and crash the OSDs, they do not show as up, and my CEPH cluster works as expected (ie degraded but functioning)

Code:

root@xxxx-proxmox1:~# ceph osd status
ID  HOST            USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
 0  xxxx-proxmox1   575G  3001G     45      708k      6     75.2k  exists,up
 1  xxxx-proxmox1   671G  2905G     24      260k      1     1638   exists,up
 2  xxxx-proxmox1   520G  3056G     11     1468k      0        0   exists,up
 3  xxxx-proxmox1   613G  2963G     53      719k      1        0   exists,up
 4  xxxx-proxmox2   577G  2999G      8     76.7k      1        0   exists,up
 5  xxxx-proxmox2   576G  3000G     33     2082k      2     7372   exists,up
 6  xxxx-proxmox2   558G  3018G     19      240k      8      115k  exists,up
 7  xxxx-proxmox2   670G  2906G     68      708k      2     15.1k  exists,up
 8  xxxx-proxmox3   578G  2998G    211     1630k      0        0   exists
 9  xxxx-proxmox3   613G  2963G     76      780k      1        0   exists
10  xxxx-proxmox3   670G  2906G    401     2863k      0        0   exists
11  xxxx-proxmox3   520G  3056G    260     1597k      0        0   exists

lightnet-barry · May 19, 2023

Thanks for the input everyone, I should just read my own documentation!

I've somehow mangled "noout nobackfill norecover", into "nodown noout" in my node reboot process!

Search

Search

Hyperconverged cluster VMs become inaccessible when one node is down

lightnet-barry

Active Member

leesteken

Distinguished Member

lightnet-barry

Active Member

lightnet-barry

Active Member

LnxBil

Distinguished Member

alexskysilk

Distinguished Member

lightnet-barry

Active Member

lightnet-barry

Active Member

lightnet-barry

Active Member