VMs are halted on node down

Kilbukas

New Member
Mar 21, 2023
6
0
1
Hello,

I have 4 nodes cluster running PVE 7.4.16 with Ceph 17.2.6. When i try to reboot one of the nodes, VMs are halted and i cannot reboot/start/reset/shutdown until all OSD are up and in, because Ceph OSD's are degraded. It doesn't matter if noout flag is set or not. Ceph has 3 monitors and 3 managers running (nodes with VMs), 4'th node has no VMs running, only ceph.
There is no errors in syslog, only in ceph monitor log it shows this:
Sep 17 17:08:41 pve ceph-mgr[1578]: 2023-09-17T17:08:41.715+0300 7f88f4bae000 -1 mgr[py] Module pg_autoscaler has missing NOTIFY_TYPES member
Sep 17 17:08:41 pve ceph-mgr[1578]: 2023-09-17T17:08:41.810+0300 7f88f4bae000 -1 mgr[py] Module status has missing NOTIFY_TYPES member
Sep 17 17:08:41 pve ceph-mgr[1578]: 2023-09-17T17:08:41.892+0300 7f88f4bae000 -1 mgr[py] Module osd_support has missing NOTIFY_TYPES member
Sep 17 17:08:42 pve ceph-mgr[1578]: 2023-09-17T17:08:42.106+0300 7f88f4bae000 -1 mgr[py] Module alerts has missing NOTIFY_TYPES member
Sep 17 17:08:42 pve ceph-mgr[1578]: 2023-09-17T17:08:42.443+0300 7f88f4bae000 -1 mgr[py] Module telegraf has missing NOTIFY_TYPES member
Sep 17 17:08:42 pve ceph-mgr[1578]: 2023-09-17T17:08:42.583+0300 7f88f4bae000 -1 mgr[py] Module selftest has missing NOTIFY_TYPES member
Sep 17 17:08:42 pve ceph-mgr[1578]: 2023-09-17T17:08:42.816+0300 7f88f4bae000 -1 mgr[py] Module prometheus has missing NOTIFY_TYPES member
Sep 17 17:08:42 pve ceph-mgr[1578]: 2023-09-17T17:08:42.968+0300 7f88f4bae000 -1 mgr[py] Module test_orchestrator has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.140+0300 7f88f4bae000 -1 mgr[py] Module telemetry has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.208+0300 7f88f4bae000 -1 mgr[py] Module progress has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.416+0300 7f88f4bae000 -1 mgr[py] Module orchestrator has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.484+0300 7f88f4bae000 -1 mgr[py] Module influx has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.556+0300 7f88f4bae000 -1 mgr[py] Module devicehealth has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.788+0300 7f88f4bae000 -1 mgr[py] Module nfs has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.856+0300 7f88f4bae000 -1 mgr[py] Module zabbix has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: context.c:56: warning: mpd_setminalloc: ignoring request to set MPD_MINALLOC a second time
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.956+0300 7f88f4bae000 -1 mgr[py] Module rbd_support has missing NOTIFY_TYPES member
Sep 17 17:08:44 pve ceph-mgr[1578]: 2023-09-17T17:08:44.164+0300 7f88f4bae000 -1 mgr[py] Module volumes has missing NOTIFY_TYPES member
Sep 17 17:08:44 pve ceph-mgr[1578]: 2023-09-17T17:08:44.236+0300 7f88f4bae000 -1 mgr[py] Module osd_perf_query has missing NOTIFY_TYPES member
Sep 17 17:08:44 pve ceph-mgr[1578]: 2023-09-17T17:08:44.312+0300 7f88f4bae000 -1 mgr[py] Module crash has missing NOTIFY_TYPES member
Sep 17 17:08:44 pve ceph-mgr[1578]: 2023-09-17T17:08:44.392+0300 7f88f4bae000 -1 mgr[py] Module snap_schedule has missing NOTIFY_TYPES member
Sep 17 17:08:44 pve ceph-mgr[1578]: 2023-09-17T17:08:44.456+0300 7f88f4bae000 -1 mgr[py] Module iostat has missing NOTIFY_TYPES member
Sep 17 17:08:44 pve ceph-mgr[1578]: 2023-09-17T17:08:44.532+0300 7f88f4bae000 -1 mgr[py] Module balancer has missing NOTIFY_TYPES member
Sep 18 00:00:57 pve ceph-mgr[1578]: 2023-09-18T00:00:57.218+0300 7f88f0b48700 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror (PID: 279750) UID: 0
Sep 18 00:00:57 pve ceph-mgr[1578]: 2023-09-18T00:00:57.238+0300 7f88f0b48700 -1 received signal: Hangup from (PID: 279751) UID: 0
Sep 19 00:00:57 pve ceph-mgr[1578]: 2023-09-19T00:00:57.210+0300 7f88f0b48700 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror (PID: 1221089) UID: 0
Sep 19 00:00:57 pve ceph-mgr[1578]: 2023-09-19T00:00:57.226+0300 7f88f0b48700 -1 received signal: Hangup from (PID: 1221090) UID: 0



Maybe there is some configuration missing in cluster?
 
Last edited:
Cluster status shows degraded, because OSD's are down.

Replicated pool.

Crush map:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class ssd
device 16 osd.16 class ssd
device 17 osd.17 class ssd
device 18 osd.18 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host pve1 {
id -3 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 6.98639
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.74660
item osd.1 weight 1.74660
item osd.2 weight 1.74660
item osd.3 weight 1.74660
}
host pve2 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 7.40834
alg straw2
hash 0 # rjenkins1
item osd.4 weight 1.86299
item osd.5 weight 0.93149
item osd.6 weight 0.90970
item osd.7 weight 0.93149
item osd.8 weight 0.93149
item osd.9 weight 0.93149
item osd.10 weight 0.90970
}
host pve3 {
id -7 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 6.98639
alg straw2
hash 0 # rjenkins1
item osd.11 weight 1.74660
item osd.12 weight 1.74660
item osd.13 weight 1.74660
item osd.14 weight 1.74660
}
host pve4 {
id -9 # do not change unnecessarily
id -10 class ssd # do not change unnecessarily
# weight 6.98639
alg straw2
hash 0 # rjenkins1
item osd.15 weight 1.74660
item osd.16 weight 1.74660
item osd.17 weight 1.74660
item osd.18 weight 1.74660
}
root default {
id -1 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 28.36751
alg straw2
hash 0 # rjenkins1
item pve05 weight 6.98639
item pve06 weight 7.40834
item pve-02-prod weight 6.98639
item pve-01-prod weight 6.98639
}

# rules
rule replicated_rule {
id 0
type replicated
step take default
step chooseleaf firstn 0 type host
step emit
}
 
Please post the output of "ceph osd tree".

Your root "default" contains four hosts named pve05, pve06, pve-02-prod and pve-01-prod.
There are four host buckets called pve1, pve2, pve3 and pve4. This does not match. There is something wrong here.
 
Please post the output of "ceph osd tree".

Your root "default" contains four hosts named pve05, pve06, pve-02-prod and pve-01-prod.
There are four host buckets called pve1, pve2, pve3 and pve4. This does not match. There is something wrong here.
Please do not take a note on naming, i changed them in post.

osd tree:

Code:
ID  CLASS  WEIGHT    TYPE NAME             STATUS  REWEIGHT  PRI-AFF
-1         28.36751  root default
-9          6.98639      host pve-01-prod
15    ssd   1.74660          osd.15            up   1.00000  1.00000
16    ssd   1.74660          osd.16            up   1.00000  1.00000
17    ssd   1.74660          osd.17            up   1.00000  1.00000
18    ssd   1.74660          osd.18            up   1.00000  1.00000
-7          6.98639      host pve-02-prod
11    ssd   1.74660          osd.11            up   1.00000  1.00000
12    ssd   1.74660          osd.12            up   1.00000  1.00000
13    ssd   1.74660          osd.13            up   1.00000  1.00000
14    ssd   1.74660          osd.14            up   1.00000  1.00000
-3          6.98639      host pve05
 0    ssd   1.74660          osd.0             up   1.00000  1.00000
 1    ssd   1.74660          osd.1             up   1.00000  1.00000
 2    ssd   1.74660          osd.2             up   1.00000  1.00000
 3    ssd   1.74660          osd.3             up   1.00000  1.00000
-5          7.40834      host pve06
 4    ssd   1.86299          osd.4             up   1.00000  1.00000
 5    ssd   0.93149          osd.5             up   1.00000  1.00000
 6    ssd   0.90970          osd.6             up   1.00000  1.00000
 7    ssd   0.93149          osd.7             up   1.00000  1.00000
 8    ssd   0.93149          osd.8             up   1.00000  1.00000
 9    ssd   0.93149          osd.9             up   1.00000  1.00000
10    ssd   0.90970          osd.10            up   1.00000  1.00000
 
Last edited:
Please do not do this.
It won't happen again :)

output of ceph -s

Code:
  cluster:
    id:     c391ba66-3e41-48c6-9ceb-80006929796a
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum pve-02-prod,pve-01-prod,pve05 (age 20h)
    mgr: pve05(active, since 20h), standbys: pve-01-prod, pve-02-prod
    osd: 19 osds: 19 up (since 20h), 19 in (since 2d)

  data:
    pools:   2 pools, 129 pgs
    objects: 789.08k objects, 3.0 TiB
    usage:   5.8 TiB used, 23 TiB / 28 TiB avail
    pgs:     129 active+clean

  io:
    client:   229 KiB/s rd, 2.4 MiB/s wr, 10 op/s rd, 144 op/s wr
 
yes, now it shows that cluster is ok, but for testing purposes if i power down one node - it halts all running vms.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!