VMs are halted on node down

Kilbukas · Sep 20, 2023

Hello,

I have 4 nodes cluster running PVE 7.4.16 with Ceph 17.2.6. When i try to reboot one of the nodes, VMs are halted and i cannot reboot/start/reset/shutdown until all OSD are up and in, because Ceph OSD's are degraded. It doesn't matter if noout flag is set or not. Ceph has 3 monitors and 3 managers running (nodes with VMs), 4'th node has no VMs running, only ceph.
There is no errors in syslog, only in ceph monitor log it shows this:
Sep 17 17:08:41 pve ceph-mgr[1578]: 2023-09-17T17:08:41.715+0300 7f88f4bae000 -1 mgr[py] Module pg_autoscaler has missing NOTIFY_TYPES member
Sep 17 17:08:41 pve ceph-mgr[1578]: 2023-09-17T17:08:41.810+0300 7f88f4bae000 -1 mgr[py] Module status has missing NOTIFY_TYPES member
Sep 17 17:08:41 pve ceph-mgr[1578]: 2023-09-17T17:08:41.892+0300 7f88f4bae000 -1 mgr[py] Module osd_support has missing NOTIFY_TYPES member
Sep 17 17:08:42 pve ceph-mgr[1578]: 2023-09-17T17:08:42.106+0300 7f88f4bae000 -1 mgr[py] Module alerts has missing NOTIFY_TYPES member
Sep 17 17:08:42 pve ceph-mgr[1578]: 2023-09-17T17:08:42.443+0300 7f88f4bae000 -1 mgr[py] Module telegraf has missing NOTIFY_TYPES member
Sep 17 17:08:42 pve ceph-mgr[1578]: 2023-09-17T17:08:42.583+0300 7f88f4bae000 -1 mgr[py] Module selftest has missing NOTIFY_TYPES member
Sep 17 17:08:42 pve ceph-mgr[1578]: 2023-09-17T17:08:42.816+0300 7f88f4bae000 -1 mgr[py] Module prometheus has missing NOTIFY_TYPES member
Sep 17 17:08:42 pve ceph-mgr[1578]: 2023-09-17T17:08:42.968+0300 7f88f4bae000 -1 mgr[py] Module test_orchestrator has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.140+0300 7f88f4bae000 -1 mgr[py] Module telemetry has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.208+0300 7f88f4bae000 -1 mgr[py] Module progress has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.416+0300 7f88f4bae000 -1 mgr[py] Module orchestrator has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.484+0300 7f88f4bae000 -1 mgr[py] Module influx has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.556+0300 7f88f4bae000 -1 mgr[py] Module devicehealth has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.788+0300 7f88f4bae000 -1 mgr[py] Module nfs has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.856+0300 7f88f4bae000 -1 mgr[py] Module zabbix has missing NOTIFY_TYPES member
Sep 17 17:08:43 pve ceph-mgr[1578]: context.c:56: warning: mpd_setminalloc: ignoring request to set MPD_MINALLOC a second time
Sep 17 17:08:43 pve ceph-mgr[1578]: 2023-09-17T17:08:43.956+0300 7f88f4bae000 -1 mgr[py] Module rbd_support has missing NOTIFY_TYPES member
Sep 17 17:08:44 pve ceph-mgr[1578]: 2023-09-17T17:08:44.164+0300 7f88f4bae000 -1 mgr[py] Module volumes has missing NOTIFY_TYPES member
Sep 17 17:08:44 pve ceph-mgr[1578]: 2023-09-17T17:08:44.236+0300 7f88f4bae000 -1 mgr[py] Module osd_perf_query has missing NOTIFY_TYPES member
Sep 17 17:08:44 pve ceph-mgr[1578]: 2023-09-17T17:08:44.312+0300 7f88f4bae000 -1 mgr[py] Module crash has missing NOTIFY_TYPES member
Sep 17 17:08:44 pve ceph-mgr[1578]: 2023-09-17T17:08:44.392+0300 7f88f4bae000 -1 mgr[py] Module snap_schedule has missing NOTIFY_TYPES member
Sep 17 17:08:44 pve ceph-mgr[1578]: 2023-09-17T17:08:44.456+0300 7f88f4bae000 -1 mgr[py] Module iostat has missing NOTIFY_TYPES member
Sep 17 17:08:44 pve ceph-mgr[1578]: 2023-09-17T17:08:44.532+0300 7f88f4bae000 -1 mgr[py] Module balancer has missing NOTIFY_TYPES member
Sep 18 00:00:57 pve ceph-mgr[1578]: 2023-09-18T00:00:57.218+0300 7f88f0b48700 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror (PID: 279750) UID: 0
Sep 18 00:00:57 pve ceph-mgr[1578]: 2023-09-18T00:00:57.238+0300 7f88f0b48700 -1 received signal: Hangup from (PID: 279751) UID: 0
Sep 19 00:00:57 pve ceph-mgr[1578]: 2023-09-19T00:00:57.210+0300 7f88f0b48700 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror (PID: 1221089) UID: 0
Sep 19 00:00:57 pve ceph-mgr[1578]: 2023-09-19T00:00:57.226+0300 7f88f0b48700 -1 received signal: Hangup from (PID: 1221090) UID: 0

Maybe there is some configuration missing in cluster?

gurubert · Sep 20, 2023

In what kind of pool are the VM images stored? Replicated or erasure coded?
How does the crush map look?
What is the status of the cluster (ceph -s) when you shutdown the fourth node?

Kilbukas · Sep 20, 2023

Cluster status shows degraded, because OSD's are down.

Replicated pool.

Crush map:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class ssd
device 16 osd.16 class ssd
device 17 osd.17 class ssd
device 18 osd.18 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host pve1 {
id -3 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 6.98639
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.74660
item osd.1 weight 1.74660
item osd.2 weight 1.74660
item osd.3 weight 1.74660
}
host pve2 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 7.40834
alg straw2
hash 0 # rjenkins1
item osd.4 weight 1.86299
item osd.5 weight 0.93149
item osd.6 weight 0.90970
item osd.7 weight 0.93149
item osd.8 weight 0.93149
item osd.9 weight 0.93149
item osd.10 weight 0.90970
}
host pve3 {
id -7 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 6.98639
alg straw2
hash 0 # rjenkins1
item osd.11 weight 1.74660
item osd.12 weight 1.74660
item osd.13 weight 1.74660
item osd.14 weight 1.74660
}
host pve4 {
id -9 # do not change unnecessarily
id -10 class ssd # do not change unnecessarily
# weight 6.98639
alg straw2
hash 0 # rjenkins1
item osd.15 weight 1.74660
item osd.16 weight 1.74660
item osd.17 weight 1.74660
item osd.18 weight 1.74660
}
root default {
id -1 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 28.36751
alg straw2
hash 0 # rjenkins1
item pve05 weight 6.98639
item pve06 weight 7.40834
item pve-02-prod weight 6.98639
item pve-01-prod weight 6.98639
}

# rules
rule replicated_rule {
id 0
type replicated
step take default
step chooseleaf firstn 0 type host
step emit
}

gurubert · Sep 20, 2023

Please post the output of "ceph osd tree".

Your root "default" contains four hosts named pve05, pve06, pve-02-prod and pve-01-prod.
There are four host buckets called pve1, pve2, pve3 and pve4. This does not match. There is something wrong here.

Kilbukas · Sep 20, 2023

gurubert said:
Please post the output of "ceph osd tree".

Your root "default" contains four hosts named pve05, pve06, pve-02-prod and pve-01-prod.
There are four host buckets called pve1, pve2, pve3 and pve4. This does not match. There is something wrong here.

Please do not take a note on naming, i changed them in post.

osd tree:

Code:

ID  CLASS  WEIGHT    TYPE NAME             STATUS  REWEIGHT  PRI-AFF
-1         28.36751  root default
-9          6.98639      host pve-01-prod
15    ssd   1.74660          osd.15            up   1.00000  1.00000
16    ssd   1.74660          osd.16            up   1.00000  1.00000
17    ssd   1.74660          osd.17            up   1.00000  1.00000
18    ssd   1.74660          osd.18            up   1.00000  1.00000
-7          6.98639      host pve-02-prod
11    ssd   1.74660          osd.11            up   1.00000  1.00000
12    ssd   1.74660          osd.12            up   1.00000  1.00000
13    ssd   1.74660          osd.13            up   1.00000  1.00000
14    ssd   1.74660          osd.14            up   1.00000  1.00000
-3          6.98639      host pve05
 0    ssd   1.74660          osd.0             up   1.00000  1.00000
 1    ssd   1.74660          osd.1             up   1.00000  1.00000
 2    ssd   1.74660          osd.2             up   1.00000  1.00000
 3    ssd   1.74660          osd.3             up   1.00000  1.00000
-5          7.40834      host pve06
 4    ssd   1.86299          osd.4             up   1.00000  1.00000
 5    ssd   0.93149          osd.5             up   1.00000  1.00000
 6    ssd   0.90970          osd.6             up   1.00000  1.00000
 7    ssd   0.93149          osd.7             up   1.00000  1.00000
 8    ssd   0.93149          osd.8             up   1.00000  1.00000
 9    ssd   0.93149          osd.9             up   1.00000  1.00000
10    ssd   0.90970          osd.10            up   1.00000  1.00000

LnxBil · Sep 20, 2023

Kilbukas said:
osd tree:

please post in CODE tags, output is unreadable.

Kilbukas · Sep 20, 2023

LnxBil said:
please post in CODE tags, output is unreadable.

post edited, is it okay?

LnxBil · Sep 20, 2023

Kilbukas said:
post edited, is it okay?

Better, maybe remove every other line (the empty ones), so that everything is nice and tidy.

gurubert · Sep 20, 2023

Kilbukas said:
i changed them in post.

Please do not do this.

gurubert · Sep 20, 2023

We also need the output of "ceph -s".

Kilbukas · Sep 20, 2023

gurubert said:
Please do not do this.

It won't happen again

output of ceph -s

Code:

  cluster:
    id:     c391ba66-3e41-48c6-9ceb-80006929796a
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum pve-02-prod,pve-01-prod,pve05 (age 20h)
    mgr: pve05(active, since 20h), standbys: pve-01-prod, pve-02-prod
    osd: 19 osds: 19 up (since 20h), 19 in (since 2d)

  data:
    pools:   2 pools, 129 pgs
    objects: 789.08k objects, 3.0 TiB
    usage:   5.8 TiB used, 23 TiB / 28 TiB avail
    pgs:     129 active+clean

  io:
    client:   229 KiB/s rd, 2.4 MiB/s wr, 10 op/s rd, 144 op/s wr

gurubert · Sep 20, 2023

This shows that the cluster is HEALTH_OK.
Could you show the output of ceph -s and ceph osd tree when the cluster is not OK?

Kilbukas · Sep 20, 2023

yes, now it shows that cluster is ok, but for testing purposes if i power down one node - it halts all running vms.

Kilbukas · Aug 22, 2024

Hello, back to my issue. Today i did a test with powering down one node and vm halt happened again. HA moved VMs to working node, but VM halted, proxmoxx VM console not responding.

Console error:

ceph -s screenshot then cluster is warn state:

UdoB · Aug 22, 2024

Kilbukas said:
ceph -s screenshot then cluster is warn state:

That's just a warning and should not affect running VMs nor its administration. (Except perhaps degraded network performance, depending on your network setup...)

What about the PVE cluster status regarding Quorum? What gives pvecm status?

Kilbukas · Aug 22, 2024

UdoB said:
That's just a warning and should not affect running VMs nor its administration. (Except perhaps degraded network performance, depending on your network setup...)

What about the PVE cluster status regarding Quorum? What gives pvecm status?

pvecm status (it shows 4 hosts, but pve06 doesnt have ceph config or running vm's, just joined to cluster):

Code:

Cluster information
-------------------
Name:             elmclu1
Config Version:   4
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Aug 22 15:06:51 2024
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000001
Ring ID:          1.a2
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.81.50 (local)
0x00000002          1 10.10.81.60
0x00000003          1 10.10.81.20
0x00000004          1 10.10.81.10[/CODE

Kilbukas · Aug 22, 2024

Also, tested again with 512 PG:

ceph log shows this, when i powered down one node.:

Code:

 2024-08-22T14:38:33.409966+0300 mgr.pve05 (mgr.35494109) 11409260 : cluster [DBG] pgmap v11413349: 513 pgs: 513 active+clean; 5.5 TiB data, 11 TiB used, 10 TiB / 21 TiB avail; 2.9 MiB/s rd, 2.5 MiB/s wr, 393 op/s
2024-08-22T14:38:35.411185+0300 mgr.pve05 (mgr.35494109) 11409261 : cluster [DBG] pgmap v11413350: 513 pgs: 513 active+clean; 5.5 TiB data, 11 TiB used, 10 TiB / 21 TiB avail; 1.8 MiB/s rd, 1.7 MiB/s wr, 224 op/s
2024-08-22T14:38:37.412049+0300 mgr.pve05 (mgr.35494109) 11409262 : cluster [DBG] pgmap v11413351: 513 pgs: 513 active+clean; 5.5 TiB data, 11 TiB used, 10 TiB / 21 TiB avail; 785 KiB/s rd, 674 KiB/s wr, 118 op/s
2024-08-22T14:38:37.815758+0300 mon.pve05 (mon.2) 3671794 : cluster [INF] mon.pve05 calling monitor election
2024-08-22T14:38:37.824768+0300 mon.pve-02-prod (mon.0) 2908600 : cluster [INF] mon.pve-02-prod calling monitor election
2024-08-22T14:38:39.412732+0300 mgr.pve05 (mgr.35494109) 11409263 : cluster [DBG] pgmap v11413352: 513 pgs: 513 active+clean; 5.5 TiB data, 11 TiB used, 10 TiB / 21 TiB avail; 306 KiB/s rd, 341 KiB/s wr, 57 op/s
2024-08-22T14:38:41.413817+0300 mgr.pve05 (mgr.35494109) 11409264 : cluster [DBG] pgmap v11413353: 513 pgs: 513 active+clean; 5.5 TiB data, 11 TiB used, 10 TiB / 21 TiB avail; 116 KiB/s rd, 99 KiB/s wr, 16 op/s
2024-08-22T14:38:42.852052+0300 mon.pve-02-prod (mon.0) 2908601 : cluster [INF] mon.pve-02-prod is new leader, mons pve-02-prod,pve05 in quorum (ranks 0,2)
2024-08-22T14:38:42.875251+0300 mon.pve-02-prod (mon.0) 2908602 : cluster [DBG] monmap e11: 3 mons at {pve-01-prod=[v2:10.10.81.10:3300/0,v1:10.10.81.10:6789/0],pve-02-prod=[v2:10.10.81.20:3300/0,v1:10.10.81.20:6789/0],pve05=[v2:10.10.81.50:3300/0,v1:10.10.81.50:6789/0]} removed_ranks: {2}
2024-08-22T14:38:42.897874+0300 mon.pve-02-prod (mon.0) 2908603 : cluster [DBG] fsmap
2024-08-22T14:38:42.897908+0300 mon.pve-02-prod (mon.0) 2908604 : cluster [DBG] osdmap e17325: 12 total, 12 up, 12 in
2024-08-22T14:38:42.898672+0300 mon.pve-02-prod (mon.0) 2908605 : cluster [DBG] mgrmap e38: pve05(active, since 8M), standbys: pve-02-prod, pve-01-prod
2024-08-22T14:38:42.898922+0300 mon.pve-02-prod (mon.0) 2908606 : cluster [WRN] Health check failed: 1/3 mons down, quorum pve-02-prod,pve05 (MON_DOWN)
2024-08-22T14:38:42.899822+0300 mon.pve-02-prod (mon.0) 2908607 : cluster [DBG] osd.18 reported immediately failed by osd.12
2024-08-22T14:38:42.899858+0300 mon.pve-02-prod (mon.0) 2908608 : cluster [INF] osd.18 failed (root=default,host=pve-01-prod) (connection refused reported by osd.12)
2024-08-22T14:38:42.899906+0300 mon.pve-02-prod (mon.0) 2908609 : cluster [DBG] osd.18 reported immediately failed by osd.12
2024-08-22T14:38:42.899956+0300 mon.pve-02-prod (mon.0) 2908610 : cluster [DBG] osd.18 reported immediately failed by osd.14
2024-08-22T14:38:42.900003+0300 mon.pve-02-prod (mon.0) 2908611 : cluster [DBG] osd.18 reported immediately failed by osd.14
2024-08-22T14:38:42.900046+0300 mon.pve-02-prod (mon.0) 2908612 : cluster [DBG] osd.18 reported immediately failed by osd.14
2024-08-22T14:38:42.900104+0300 mon.pve-02-prod (mon.0) 2908613 : cluster [DBG] osd.18 reported immediately failed by osd.12
2024-08-22T14:38:42.900148+0300 mon.pve-02-prod (mon.0) 2908614 : cluster [DBG] osd.18 reported immediately failed by osd.12
2024-08-22T14:38:42.900201+0300 mon.pve-02-prod (mon.0) 2908615 : cluster [DBG] osd.18 reported immediately failed by osd.11
2024-08-22T14:38:42.900258+0300 mon.pve-02-prod (mon.0) 2908616 : cluster [DBG] osd.18 reported immediately failed by osd.13
2024-08-22T14:38:42.900303+0300 mon.pve-02-prod (mon.0) 2908617 : cluster [DBG] osd.18 reported immediately failed by osd.13
2024-08-22T14:38:42.900347+0300 mon.pve-02-prod (mon.0) 2908618 : cluster [DBG] osd.18 reported immediately failed by osd.11
2024-08-22T14:38:42.900497+0300 mon.pve-02-prod (mon.0) 2908619 : cluster [DBG] osd.18 reported immediately failed by osd.14
2024-08-22T14:38:42.900549+0300 mon.pve-02-prod (mon.0) 2908620 : cluster [DBG] osd.18 reported immediately failed by osd.13
2024-08-22T14:38:42.900593+0300 mon.pve-02-prod (mon.0) 2908621 : cluster [DBG] osd.15 reported immediately failed by osd.14

And ceph -s shows and rebalancing is not active:

Code:

cluster:
    id:     c391ba66-3e41-48c6-9ceb-80006929796a
    health: HEALTH_WARN
            1/3 mons down, quorum pve-02-prod,pve05
            4 osds down
            1 host (4 osds) down
            Reduced data availability: 339 pgs inactive
            Degraded data redundancy: 977725/2960882 objects degraded (33.021%), 339 pgs degraded, 339 pgs undersized

  services:
    mon: 3 daemons, quorum pve-02-prod,pve05 (age 9m), out of quorum: pve-01-prod
    mgr: pve05(active, since 8M), standbys: pve-02-prod
    osd: 12 osds: 8 up (since 9m), 12 in (since 3h)

  data:
    pools:   2 pools, 513 pgs
    objects: 1.48M objects, 5.5 TiB
    usage:   11 TiB used, 10 TiB / 21 TiB avail
    pgs:     66.082% pgs not active
             977725/2960882 objects degraded (33.021%)
             339 undersized+degraded+peered
             174 active+clean

Kilbukas · Aug 23, 2024

Ping, maybe someone has thoughts?

VMs are halted on node down

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Distinguished Member

Distinguished Member

Member

Distinguished Member

Member

Member

Distinguished Member

Member

Member

Attachments

Member

We value your privacy