[SOLVED] cannot start ha resource when ceph in health_warn state

Discussion in 'Proxmox VE: Installation and configuration' started by rseffner, Dec 25, 2017.

  1. rseffner

    rseffner New Member

    Joined:
    Dec 25, 2017
    Messages:
    8
    Likes Received:
    0
    Hi,

    I'm playing with proxmox, ha and ceph in testing environment. So I set up 3 ceph enabled proxmox nodes including 3 ceph monitors but with an ceph pool of two nodes available/one required. I think this can be the smallest ha configuration possible. (So third node is only for quorum and ceph-monitor, but has nearly no storage).

    If I run an ha enabled vm on node1, shutting node1 down via proxmox ui, the vm will be started on node2 a few minutes later.
    If I do nearly the same thing, but switching node1 hard off, the ui shows vm running on node2 also a few miuntes later but this vm is not ping- or accessable. If I power on node1 and restart vm everything works again.

    So I belive ceph in state HEALTH_WARN makes the ha resource not really startable. In log files I found "start failed: command '/usr/bin/kvm -id 1...ccel=tcg'' failed: got timeout".

    Because of working shutdown-scenario I'm sure, vm is playable with one node down. But worst case will not shutdown a node but crash them. What to do, to get my second scenario (power loss/hard reset) working in sense of automagicaly respawn vm on node2?

    regards
    rseffner
     
  2. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,806
    Likes Received:
    157
    What is the full error message? A 'ceph osd tree' and 'ceph osd dump' would be nice, to se the distribution and which node has what function.

    A HEALTH_WARN in Ceph is not hindering the start of VM/CT on other nodes through HA. It depends greatly on the setup of the shared storage and HA to have it working properly.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. rseffner

    rseffner New Member

    Joined:
    Dec 25, 2017
    Messages:
    8
    Likes Received:
    0
    Hi Alwin (and others),

    thanks for your time looking at my issue.

    full error :

    task started by HA resource agent
    TASK ERROR: start failed: command '/usr/bin/kvm -id 100 -chardev 'socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/100.pid -daemonize -smbios 'type=1,uuid=ca2cb04c-43ed-428c-adfb-6098c7036a0d' -name deb9 -smp '4,sockets=2,cores=2,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/100.vnc,x509,password -cpu qemu64 -m 512 -k de -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:f1d993d5b9ef' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'lsi,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:sus-pool/vm-100-disk-1:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/sus-pool.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,scsi-id=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' -device 'e1000,mac=7A:9D:7E:F4:F9:59,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'accel=tcg'' failed: got timeout


    'ceph osd tree' healthy :

    root@mox2:~# ceph osd tree
    ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
    -1 0.03879 root default
    -3 0.01939 host mox1
    0 hdd 0.01939 osd.0 up 1.00000 1.00000
    -5 0.01939 host mox2
    1 hdd 0.01939 osd.1 up 1.00000 1.00000


    'ceph osd tree' unhealthy (after powering off node "mox2" - look at the state "up") :

    root@mox1:~# ceph osd tree
    ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
    -1 0.03879 root default
    -3 0.01939 host mox1
    0 hdd 0.01939 osd.0 up 1.00000 1.00000
    -5 0.01939 host mox2
    1 hdd 0.01939 osd.1 up 1.00000 1.00000


    'ceph osd dump' healthy :

    root@mox2:~# ceph osd dump
    epoch 121
    fsid 92157505-5d27-465e-9feb-b8e71d32224b
    created 2017-12-24 01:52:17.703537
    modified 2017-12-28 09:50:18.352252
    flags sortbitwise,recovery_deletes,purged_snapdirs
    crush_version 5
    full_ratio 0.95
    backfillfull_ratio 0.9
    nearfull_ratio 0.85
    require_min_compat_client jewel
    min_compat_client jewel
    require_osd_release luminous
    pool 4 'sus-pool' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 84 flags hashpspool stripe_width 0 application rbd
    removed_snaps [1~3]
    max_osd 2
    osd.0 up in weight 1 up_from 109 up_thru 119 down_at 105 last_clean_interval [86,104) 192.168.99.101:6800/1553 192.168.99.101:6801/1553 192.168.99.101:6802/1 553 192.168.99.101:6803/1553 exists,up b05a32d2-4b51-4f49-9a20-52a4b16e679d
    osd.1 up in weight 1 up_from 119 up_thru 119 down_at 117 last_clean_interval [101,115) 192.168.99.102:6801/1581 192.168.99.102:6802/1581 192.168.99.102:6803/1581 192.168.99.102:6804/1581 exists,up 1e198f49-4fe6-476e-87f3-c3c50411efed
    blacklist 192.168.99.102:0/3451113584 expires 2017-12-28 10:50:18.301930


    'ceph osd dump' unhealthy (after powering off node "mox2" - look at the state "up") :

    root@mox1:~# ceph osd dump
    epoch 121
    fsid 92157505-5d27-465e-9feb-b8e71d32224b
    created 2017-12-24 01:52:17.703537
    modified 2017-12-28 09:50:18.352252
    flags sortbitwise,recovery_deletes,purged_snapdirs
    crush_version 5
    full_ratio 0.95
    backfillfull_ratio 0.9
    nearfull_ratio 0.85
    require_min_compat_client jewel
    min_compat_client jewel
    require_osd_release luminous
    pool 4 'sus-pool' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 84 flags hashpspool stripe_width 0 application rbd
    removed_snaps [1~3]
    max_osd 2
    osd.0 up in weight 1 up_from 109 up_thru 119 down_at 105 last_clean_interval [86,104) 192.168.99.101:6800/1553 192.168.99.101:6801/1553 192.168.99.101:6802/1553 192.168.99.101:6803/1553 exists,up b05a32d2-4b51-4f49-9a20-52a4b16e679d
    osd.1 up in weight 1 up_from 119 up_thru 119 down_at 117 last_clean_interval [101,115) 192.168.99.102:6801/1581 192.168.99.102:6802/1581 192.168.99.102:6803/1581 192.168.99.102:6804/1581 exists,up 1e198f49-4fe6-476e-87f3-c3c50411efed
    blacklist 192.168.99.102:0/3451113584 expires 2017-12-28 10:50:18.301930


    VM100 was first running at "mox2". I was able to ping and ssh these VM. So I generatet 'ceph' output as requestet at "mox2". Then I switched off "mox2" and a few minutes later I generated new 'ceph' output from "mox1" at a time the UI shows VM100 running at "mox1". Now I'm not able to ping or SSH these VM. I think they only looks startet but is'nt.
     
  4. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,806
    Likes Received:
    157
    What is 'ceph -s' telling, when you turned of mox2? Is your ceph cluster virtual or physical?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. rseffner

    rseffner New Member

    Joined:
    Dec 25, 2017
    Messages:
    8
    Likes Received:
    0
    'ceph -s' output :

    cluster:
    id: 92157505-5d27-465e-9feb-b8e71d32224b
    health: HEALTH_WARN
    Reduced data availability: 32 pgs inactive
    Degraded dada redundancy: 32 pgs unclean
    1/3 mons down, quorum mox1,mox3

    services:
    mon: 3 daemons, quorum mox1,mox3, out of quorum: mox2
    mgr: mox1(active), standbys: mox3
    osd: 2 osds: 2 up, 2 in

    data:
    pools: 1 pools, 64 pgs
    objects: 262 objects, 946 MB
    usage: 3036 MB used, 17342 MB / 20378 MB avail
    pgs: 20.000% pgs unknown
    32 unknown
    32 active+clean

    Its (proxmox and ceph) only for learning and testing purposes, so I used VMWare Workstation on Windows10 to run 3 debian9 guests (mox1 to 3) running proxmox and ceph. So I think the right answer is: virtual.
     
  6. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,806
    Likes Received:
    157
    Is mox2, completely off? The osd should be down too. Did you change anything on the crushmap or ceph.conf?

    Is the network on the workstation configured correctly? It might just be that the nested virtual network is not working on mox1 to get a ping through.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  7. rseffner

    rseffner New Member

    Joined:
    Dec 25, 2017
    Messages:
    8
    Likes Received:
    0
    Hello again,

    I'm sure "mox2" is really off. I also never touched crushmap or ceph.conf - I'm in an really early state of experimenting with this bunch of software.
    Mox1-3 were able to ping each other - instead of one is switched off of course, then this - and only this node - is not pingable.

    A friend of mine copied my test scenario with real hardware and there is the same behavior. With two (1+1) or four (2+2) osd on "mox1" and "mox2" the same timout happens, If he adds and 3rd/5th osd to "mox" - in this case an USB-stick smaller than the VM. Switching off "mox2" now works as expected. Do I need an unequal amount of osd's? Did every ceph monitor need an osd?

    Best regards,
    Ronny
     
  8. Jarek

    Jarek Member

    Joined:
    Dec 16, 2016
    Messages:
    54
    Likes Received:
    7
    Are you sure that you set min_size for this pool to 1?
    Please show 'ceph health detail' when cluster is in health_warn state.
     
  9. rseffner

    rseffner New Member

    Joined:
    Dec 25, 2017
    Messages:
    8
    Likes Received:
    0
    Yes.

    HEALTH_WARN Degraded data redundancy: 528/1056 objects degraded (50.000%), 64 pgs unclean, 64 pgs degraded, 64 pgs undersized; clock skew detected on mon.mox3; 1/3 mons down, quorum mox1,mox3
    PG_DEGRADED Degraded data redundancy: 528/1056 objects degraded (50.000%), 64 pgs unclean, 64 pgs degraded, 64 pgs undersized
    pg 4.0 is stuck undersized for 650.359771, current state active+undersized+degraded, last acting [0]

    this repeats with different "pg N.N" and "undersized for NNN.NNNNNN"

    MON_CLOCK_SKEW clock skew detected on mon.mox3
    mon.mox3 addr 192.168.99.103:6789/0 clock skew 9.62597s > max 0.05s (latency 0.894934s)
    MON_DOWN 1/3 mons down, quorum mox1,mox3
    mon.mox2 (rank 1) addr 192.168.99.102:6789/0 is down (out of quorum)
     
  10. Jarek

    Jarek Member

    Joined:
    Dec 16, 2016
    Messages:
    54
    Likes Received:
    7
    Code:
    ceph osd pool get [your pool name] size
    ceph osd pool get [your pool name] min_size
     
  11. rseffner

    rseffner New Member

    Joined:
    Dec 25, 2017
    Messages:
    8
    Likes Received:
    0
    size: 2
    min_size: 1
     
  12. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,806
    Likes Received:
    157
    I can reporduce this myself. This looks like ceph doesn't mark the OSDs down. Though it might be some timing or config issue, as it doesn't happen if there are more then four OSDs.

    For a test, let the cluster run for a while and check if the state of the OSDs changes.
    Code:
    watch -d "ceph -s && echo && ceph osd tree"
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  13. rseffner

    rseffner New Member

    Joined:
    Dec 25, 2017
    Messages:
    8
    Likes Received:
    0
    I wish you best for new year.

    Sorry for my late reply, hoping someone is still reading here.
    After hours of running in node-off-state the mentioned command still shows 2 ceph-nodes up.
     
  14. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,806
    Likes Received:
    157
    As your setup is very small, the levels when an OSD is being marked as down, are higher by default.
    http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/#osds-report-down-osds

    This is to prevent false alarms of running OSDs marked as down. A change should be considered carefully, as this can influence the clusters ability to recover.

    Code:
    mon_osd_reporter_subtree_level = osd
    As described in the Ceph documentation, the subtree level is set to host (actually to the common ancestor type in CRUSH map) by default, which then needs two other hosts to mark down an OSD on one host.

    Code:
    mon_osd_min_down_reporters = 1
    Or you set the value for how many OSDs are needed to report a OSD down to one, default two. So only one subtree for a down report is needed.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    rseffner likes this.
  15. rseffner

    rseffner New Member

    Joined:
    Dec 25, 2017
    Messages:
    8
    Likes Received:
    0
    Thank you Alwin.
    Now I'm able to solve my issue and understand the behavior. I missed enough reporting hosts or OSDs so the ceph cluster runs into a report timeout in best case to mark a host / OSDs down after a long period of time.
     
  16. ProxCH

    ProxCH New Member

    Joined:
    Jan 5, 2019
    Messages:
    29
    Likes Received:
    0
    Hello,

    I am encountering this same issue. Here is my architecture :

    3 Nodes, 3 Ceph but only 2 nodes hosting 2 OSD each. I have the exact symptoms described above and I guess that your fix should suit for me but first I like to be sure that it is correct :

    -------------------------------------------------------------
    [global]
    auth client required = cephx
    auth cluster required = cephx
    auth service required = cephx
    cluster network = 192.168.10.0/24
    fsid = a449e595-f04f-4154-b236-81e6272af761
    keyring = /etc/pve/priv/$cluster.$name.keyring
    mon allow pool delete = true
    osd journal size = 5120
    osd pool default min size = 1
    osd pool default size = 2
    public network = 192.168.7.0/24

    [osd]
    keyring = /var/lib/ceph/osd/ceph-$id/keyring

    [mon.host2]
    mon addr = 192.168.7.22:6789
    mon osd reporter subtree level = osd

    [mon.host1]
    mon addr = 192.168.7.20:6789
    mon osd reporter subtree level = osd

    [mon.host3]
    mon addr = 192.168.7.21:6789
    mon osd reporter subtree level = osd

    -------------------------------------------------------------
    Or should I do it with this option mon_osd_min_down_reporters = 1 ?

    Thanks !
     
  17. ProxCH

    ProxCH New Member

    Joined:
    Jan 5, 2019
    Messages:
    29
    Likes Received:
    0
    Auto answer; putting mon osd reporter subtree level = osd on global level made it!

    Cheers
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice