[SOLVED] cannot start ha resource when ceph in health_warn state

rseffner · Dec 25, 2017

Hi,

I'm playing with proxmox, ha and ceph in testing environment. So I set up 3 ceph enabled proxmox nodes including 3 ceph monitors but with an ceph pool of two nodes available/one required. I think this can be the smallest ha configuration possible. (So third node is only for quorum and ceph-monitor, but has nearly no storage).

If I run an ha enabled vm on node1, shutting node1 down via proxmox ui, the vm will be started on node2 a few minutes later.
If I do nearly the same thing, but switching node1 hard off, the ui shows vm running on node2 also a few miuntes later but this vm is not ping- or accessable. If I power on node1 and restart vm everything works again.

So I belive ceph in state HEALTH_WARN makes the ha resource not really startable. In log files I found "start failed: command '/usr/bin/kvm -id 1...ccel=tcg'' failed: got timeout".

Because of working shutdown-scenario I'm sure, vm is playable with one node down. But worst case will not shutdown a node but crash them. What to do, to get my second scenario (power loss/hard reset) working in sense of automagicaly respawn vm on node2?

regards
rseffner

Alwin · Dec 27, 2017

rseffner said:
o I belive ceph in state HEALTH_WARN makes the ha resource not really startable. In log files I found "start failed: command '/usr/bin/kvm -id 1...ccel=tcg'' failed: got timeout".

What is the full error message? A 'ceph osd tree' and 'ceph osd dump' would be nice, to se the distribution and which node has what function.

A HEALTH_WARN in Ceph is not hindering the start of VM/CT on other nodes through HA. It depends greatly on the setup of the shared storage and HA to have it working properly.

rseffner · Dec 28, 2017

Hi Alwin (and others),

thanks for your time looking at my issue.

full error :

task started by HA resource agent
TASK ERROR: start failed: command '/usr/bin/kvm -id 100 -chardev 'socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/100.pid -daemonize -smbios 'type=1,uuid=ca2cb04c-43ed-428c-adfb-6098c7036a0d' -name deb9 -smp '4,sockets=2,cores=2,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/100.vnc,x509,password -cpu qemu64 -m 512 -k de -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:f1d993d5b9ef' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'lsi,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:sus-pool/vm-100-disk-1:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/sus-pool.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,scsi-id=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' -device 'e1000,mac=7A:9D:7E:F4:F9:59,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'accel=tcg'' failed: got timeout

'ceph osd tree' healthy :

root@mox2:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.03879 root default
-3 0.01939 host mox1
0 hdd 0.01939 osd.0 up 1.00000 1.00000
-5 0.01939 host mox2
1 hdd 0.01939 osd.1 up 1.00000 1.00000

'ceph osd tree' unhealthy (after powering off node "mox2" - look at the state "up") :

root@mox1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.03879 root default
-3 0.01939 host mox1
0 hdd 0.01939 osd.0 up 1.00000 1.00000
-5 0.01939 host mox2
1 hdd 0.01939 osd.1 up 1.00000 1.00000

'ceph osd dump' healthy :

root@mox2:~# ceph osd dump
epoch 121
fsid 92157505-5d27-465e-9feb-b8e71d32224b
created 2017-12-24 01:52:17.703537
modified 2017-12-28 09:50:18.352252
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 5
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release luminous
pool 4 'sus-pool' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 84 flags hashpspool stripe_width 0 application rbd
removed_snaps [1~3]
max_osd 2
osd.0 up in weight 1 up_from 109 up_thru 119 down_at 105 last_clean_interval [86,104) 192.168.99.101:6800/1553 192.168.99.101:6801/1553 192.168.99.101:6802/1 553 192.168.99.101:6803/1553 exists,up b05a32d2-4b51-4f49-9a20-52a4b16e679d
osd.1 up in weight 1 up_from 119 up_thru 119 down_at 117 last_clean_interval [101,115) 192.168.99.102:6801/1581 192.168.99.102:6802/1581 192.168.99.102:6803/1581 192.168.99.102:6804/1581 exists,up 1e198f49-4fe6-476e-87f3-c3c50411efed
blacklist 192.168.99.102:0/3451113584 expires 2017-12-28 10:50:18.301930

'ceph osd dump' unhealthy (after powering off node "mox2" - look at the state "up") :

root@mox1:~# ceph osd dump
epoch 121
fsid 92157505-5d27-465e-9feb-b8e71d32224b
created 2017-12-24 01:52:17.703537
modified 2017-12-28 09:50:18.352252
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 5
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release luminous
pool 4 'sus-pool' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 84 flags hashpspool stripe_width 0 application rbd
removed_snaps [1~3]
max_osd 2
osd.0 up in weight 1 up_from 109 up_thru 119 down_at 105 last_clean_interval [86,104) 192.168.99.101:6800/1553 192.168.99.101:6801/1553 192.168.99.101:6802/1553 192.168.99.101:6803/1553 exists,up b05a32d2-4b51-4f49-9a20-52a4b16e679d
osd.1 up in weight 1 up_from 119 up_thru 119 down_at 117 last_clean_interval [101,115) 192.168.99.102:6801/1581 192.168.99.102:6802/1581 192.168.99.102:6803/1581 192.168.99.102:6804/1581 exists,up 1e198f49-4fe6-476e-87f3-c3c50411efed
blacklist 192.168.99.102:0/3451113584 expires 2017-12-28 10:50:18.301930

VM100 was first running at "mox2". I was able to ping and ssh these VM. So I generatet 'ceph' output as requestet at "mox2". Then I switched off "mox2" and a few minutes later I generated new 'ceph' output from "mox1" at a time the UI shows VM100 running at "mox1". Now I'm not able to ping or SSH these VM. I think they only looks startet but is'nt.

Alwin · Dec 28, 2017

What is 'ceph -s' telling, when you turned of mox2? Is your ceph cluster virtual or physical?

rseffner · Dec 28, 2017

'ceph -s' output :

cluster:
id: 92157505-5d27-465e-9feb-b8e71d32224b
health: HEALTH_WARN
Reduced data availability: 32 pgs inactive
Degraded dada redundancy: 32 pgs unclean
1/3 mons down, quorum mox1,mox3

services:
mon: 3 daemons, quorum mox1,mox3, out of quorum: mox2
mgr: mox1(active), standbys: mox3
osd: 2 osds: 2 up, 2 in

data:
pools: 1 pools, 64 pgs
objects: 262 objects, 946 MB
usage: 3036 MB used, 17342 MB / 20378 MB avail
pgs: 20.000% pgs unknown
32 unknown
32 active+clean

Its (proxmox and ceph) only for learning and testing purposes, so I used VMWare Workstation on Windows10 to run 3 debian9 guests (mox1 to 3) running proxmox and ceph. So I think the right answer is: virtual.

Alwin · Dec 28, 2017

rseffner said:
osd: 2 osds: 2 up, 2 in

Is mox2, completely off? The osd should be down too. Did you change anything on the crushmap or ceph.conf?

rseffner said:
Its (proxmox and ceph) only for learning and testing purposes, so I used VMWare Workstation on Windows10 to run 3 debian9 guests (mox1 to 3) running proxmox and ceph. So I think the right answer is: virtual.

Is the network on the workstation configured correctly? It might just be that the nested virtual network is not working on mox1 to get a ping through.

rseffner · Dec 28, 2017

Hello again,

I'm sure "mox2" is really off. I also never touched crushmap or ceph.conf - I'm in an really early state of experimenting with this bunch of software.
Mox1-3 were able to ping each other - instead of one is switched off of course, then this - and only this node - is not pingable.

A friend of mine copied my test scenario with real hardware and there is the same behavior. With two (1+1) or four (2+2) osd on "mox1" and "mox2" the same timout happens, If he adds and 3rd/5th osd to "mox" - in this case an USB-stick smaller than the VM. Switching off "mox2" now works as expected. Do I need an unequal amount of osd's? Did every ceph monitor need an osd?

Best regards,
Ronny

Jarek · Dec 29, 2017

rseffner said:
Reduced data availability: 32 pgs inactive
Degraded dada redundancy: 32 pgs unclean

Are you sure that you set min_size for this pool to 1?
Please show 'ceph health detail' when cluster is in health_warn state.

rseffner · Dec 29, 2017

Jarek said:
Are you sure that you set min_size for this pool to 1?

Yes.

Jarek said:
Please show 'ceph health detail' when cluster is in health_warn state.

HEALTH_WARN Degraded data redundancy: 528/1056 objects degraded (50.000%), 64 pgs unclean, 64 pgs degraded, 64 pgs undersized; clock skew detected on mon.mox3; 1/3 mons down, quorum mox1,mox3
PG_DEGRADED Degraded data redundancy: 528/1056 objects degraded (50.000%), 64 pgs unclean, 64 pgs degraded, 64 pgs undersized
pg 4.0 is stuck undersized for 650.359771, current state active+undersized+degraded, last acting [0]

this repeats with different "pg N.N" and "undersized for NNN.NNNNNN"

MON_CLOCK_SKEW clock skew detected on mon.mox3
mon.mox3 addr 192.168.99.103:6789/0 clock skew 9.62597s > max 0.05s (latency 0.894934s)
MON_DOWN 1/3 mons down, quorum mox1,mox3
mon.mox2 (rank 1) addr 192.168.99.102:6789/0 is down (out of quorum)

Jarek · Dec 29, 2017

Code:

ceph osd pool get [your pool name] size
ceph osd pool get [your pool name] min_size

rseffner · Dec 29, 2017

Jarek said:
Code:

ceph osd pool get [your pool name] size

size: 2

Jarek said:
Code:

ceph osd pool get [your pool name] min_size

min_size: 1

Alwin · Dec 29, 2017

I can reporduce this myself. This looks like ceph doesn't mark the OSDs down. Though it might be some timing or config issue, as it doesn't happen if there are more then four OSDs.

For a test, let the cluster run for a while and check if the state of the OSDs changes.

Code:

watch -d "ceph -s && echo && ceph osd tree"

rseffner · Jan 3, 2018

I wish you best for new year.

Sorry for my late reply, hoping someone is still reading here.
After hours of running in node-off-state the mentioned command still shows 2 ceph-nodes up.

Alwin · Jan 3, 2018

As your setup is very small, the levels when an OSD is being marked as down, are higher by default.
http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/#osds-report-down-osds

This is to prevent false alarms of running OSDs marked as down. A change should be considered carefully, as this can influence the clusters ability to recover.

Code:

mon_osd_reporter_subtree_level = osd

As described in the Ceph documentation, the subtree level is set to host (actually to the common ancestor type in CRUSH map) by default, which then needs two other hosts to mark down an OSD on one host.

Code:

mon_osd_min_down_reporters = 1

Or you set the value for how many OSDs are needed to report a OSD down to one, default two. So only one subtree for a down report is needed.

rseffner · Jan 3, 2018

Thank you Alwin.
Now I'm able to solve my issue and understand the behavior. I missed enough reporting hosts or OSDs so the ceph cluster runs into a report timeout in best case to mark a host / OSDs down after a long period of time.

ProxCH · Jan 11, 2019

Alwin said:
As your setup is very small, the levels when an OSD is being marked as down, are higher by default.
http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/#osds-report-down-osds

This is to prevent false alarms of running OSDs marked as down. A change should be considered carefully, as this can influence the clusters ability to recover.

Code:

mon_osd_reporter_subtree_level = osd

As described in the Ceph documentation, the subtree level is set to host (actually to the common ancestor type in CRUSH map) by default, which then needs two other hosts to mark down an OSD on one host.

Code:

mon_osd_min_down_reporters = 1

Or you set the value for how many OSDs are needed to report a OSD down to one, default two. So only one subtree for a down report is needed.

Hello,

I am encountering this same issue. Here is my architecture :

3 Nodes, 3 Ceph but only 2 nodes hosting 2 OSD each. I have the exact symptoms described above and I guess that your fix should suit for me but first I like to be sure that it is correct :

-------------------------------------------------------------
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 192.168.10.0/24
fsid = a449e595-f04f-4154-b236-81e6272af761
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 1
osd pool default size = 2
public network = 192.168.7.0/24

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.host2]
mon addr = 192.168.7.22:6789
mon osd reporter subtree level = osd

[mon.host1]
mon addr = 192.168.7.20:6789
mon osd reporter subtree level = osd

[mon.host3]
mon addr = 192.168.7.21:6789
mon osd reporter subtree level = osd

-------------------------------------------------------------
Or should I do it with this option mon_osd_min_down_reporters = 1 ?

Thanks !

ProxCH · Jan 12, 2019

Auto answer; putting mon osd reporter subtree level = osd on global level made it!

Cheers

scintilla13 · Mar 16, 2020

Same problem for me, and it doesn't look to start recovery at all.

Code:

root@pve01sc:~# ceph -s
  cluster:
    id:     56c01ca1-22ee-4bb0-9093-c852ae7d120c
    health: HEALTH_ERR
            1 full osd(s)
            1 pool(s) full
            Degraded data redundancy: 535023/1781469 objects degraded (30.033%), 121 pgs degraded, 121 pgs undersized
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum pve03sc,pve01sc,pve02sc (age 13h)
    mgr: pve03sc(active, since 13h), standbys: pve02sc, pve01sc
    osd: 5 osds: 5 up (since 6h), 5 in (since 11h); 7 remapped pgs
 
  data:
    pools:   2 pools, 256 pgs
    objects: 593.82k objects, 2.2 TiB
    usage:   2.9 TiB used, 13 TiB / 16 TiB avail
    pgs:     535023/1781469 objects degraded (30.033%)
             26587/1781469 objects misplaced (1.492%)
             129 active+clean
             121 active+undersized+degraded
             6   active+clean+remapped
 
  io:
    client:   0 B/s rd, 3.1 KiB/s wr, 0 op/s rd, 0 op/s wr

I have 3 Proxmox 6.1 nodes.
3 SSD osd + 2 HDD osd.

The osd daemon has crashed today at 1 AM.
I'm a total ceph newbye, the first setup has just finished yesterday.

Code:

mon_osd_reporter_subtree_level = osd

could be a solution also for me?
But please, where to configure it?

Thanks

Alwin · Mar 16, 2020

scintilla13 said:
1 full osd(s) 1 pool(s) full

That's your issue. You need more OSDs (storage) on each node.

scintilla13 · Mar 16, 2020

Well, ok, there is a way to make CEPH recover, just one time?

[SOLVED] cannot start ha resource when ceph in health_warn state

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

Well-Known Member

New Member

Well-Known Member

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

New Member

New Member

Member

Proxmox Retired Staff

Member