Ceph problem when master node is out

Discussion in 'Proxmox VE: Installation and configuration' started by Konstantinos Pappas, Jan 7, 2015.

  1. Konstantinos Pappas

    Konstantinos Pappas New Member

    Joined:
    Jan 7, 2015
    Messages:
    27
    Likes Received:
    0
    somone to have the same problem ?
    for your information i follow this guide
    http://pve.proxmox.com/wiki/Ceph_Server

    is not make sense here, somebody from proxmox team to answer officially????
    something do it wrong ? it is problem from proxmox cluster
    i don't now what should i do and if not what the problem comes how to pass to my production ?
    please any help.
    thanks
     
  2. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,071
    Likes Received:
    24
    If your pool replica is at 1, try to change it 2. See if it makes a difference. After you change replica it will go through cluster rebalancing. It is normal.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. Konstantinos Pappas

    Konstantinos Pappas New Member

    Joined:
    Jan 7, 2015
    Messages:
    27
    Likes Received:
    0
    Hello Mr wasin

    please see the attache image, i think the replica is two, the others is one.

    Untitled.png
     
  4. Konstantinos Pappas

    Konstantinos Pappas New Member

    Joined:
    Jan 7, 2015
    Messages:
    27
    Likes Received:
    0
    Mr wasin

    let me explain again, we have 3 nodes,
    1. master demo1 > pvecm create cluster
    2. demo2 > pvecm add demo1
    3. demo3 > pvecm add demo1

    to all nodes - pveceph install, pvecepf createmon etc

    practise 1.
    for some reason turn of the node 3, then the ceph storage is work ok without any problem.

    practise 2.
    for some reason hardware failure and node 2 is off is dead. then the ceph storage is work without problem, it means i can have access to vms

    practise 3.

    lets say for some reason the master node 1 take fire, destroy whaever you want, then the ceph storage is not accessible, that means from node2, or node3 i cant see my data or vms or snapshots etc.

    and my question is

    something do it wrong or it is proxmox wrong ?

    thanks ;-)
     
  5. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    Hi Konstantinos,
    on a pve-cluster all nodes are master. It's depends only which nodes have quorum, and which not.

    Let see, what's wrong with your installation.

    on https://forum.proxmox.com/threads/20700-Ceph-problem-when-master-node-is-out?p=105653#post105653 you post your crushtable (unfortunality without line breaks - please use with https "Settings -> General Settings -> Miscellaneous Options -> Standard Editor; this will preverse your line breaks!).
    The weight of the OSDs are 0 - and this can't be right.
    A good thing is to use the TB-size as weight, so that smaller disks, if you use disks with different size, are not overloaded.

    To see the TB-Size you can use this command:
    Code:
     df -k | grep osd | awk '{print $2/(1024^3) }'
    
    For me (with 4TB-Disks and ext4-format) I got the output
    Code:
    3.58062
    3.58062
    3.58062
    3.58062
    ...
    
    In this case you should do an
    Code:
    ceph osd crush set 0 3.58 root=default host=demo1
    ceph osd crush set 1 3.58 root=default host=demo1
    ceph osd crush set 2 3.58 root=default host=demo1
    ceph osd crush set 3 3.58 root=default host=demo1
    
    ceph osd crush set 4 3.58 root=default host=demo2
    ceph osd crush set 5 3.58 root=default host=demo2
    ceph osd crush set 6 3.58 root=default host=demo2
    ceph osd crush set 7 3.58 root=default host=demo2
    
    ceph osd crush set 8 3.58 root=default host=demo3
    ceph osd crush set 9 3.58 root=default host=demo3
    ceph osd crush set 10 3.58 root=default host=demo3
    ceph osd crush set 11 3.58 root=default host=demo3
    
    The screenshot shows 3GB are round 1% usage? This mean you have 300GB disks? in this case the weight value is something like 0.28.

    But your VM stuck issue sound a little bit different.

    Run this on demo2?
    Code:
    rados -p mystorage bench 60 write
    
    If this work, does the same work with demo1 down?

    Udo
     
    #25 udo, Jan 11, 2015
    Last edited: Jan 11, 2015
  6. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    Hi Karl,
    no it's not dangerous, only helpless. Also with 4 mons only one can die! Quorum is mon/2+1 and ceph don't know anything about network partitions... I think you mean the pve-quorum.

    Udo
     
  7. Konstantinos Pappas

    Konstantinos Pappas New Member

    Joined:
    Jan 7, 2015
    Messages:
    27
    Likes Received:
    0
    udo thanks a lot for useful informations,
    as i make a deep investigation the results are and help a lot friends here

    1. the problem is not come from ceph storage etc.
    2. the Quorum belong to the server that create the cluster, i mean pvecm create cluster
    3. the other nodes to add pvecm add to node1 , etc does not belongs the Quorum, that's why when shut down demo2 or demo3 node the cluster and ceph storage working well and when demo1 (node1) shut down everything broken.
    4. it is necessary to create share disk to split Quorum via network and nodes.

    my questions is

    can i create via storage lvm share folder and push Quorum here when node1 is down ?
    there is example via iSCSI target https://pve.proxmox.com/wiki/Two-Node_High_Availability_Cluster
    but i am not familiar with that.
    also what changes need to be done into the cluster.conf ?

    any help appreciated
     
  8. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,071
    Likes Received:
    24
    This may be the reason becasue all ceph configuration and keyrings are on /etc/pve which is a cluster file system. When Proxmox loses quorum that filesystem becomes unavailable thus rendering Ceph configuration inaccessible.

    But...didnt you add both demo2 and demo3 to the Proxmox cluster already? From Proxmox GUI dont you see all 3 Proxmox nodes ?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  9. Konstantinos Pappas

    Konstantinos Pappas New Member

    Joined:
    Jan 7, 2015
    Messages:
    27
    Likes Received:
    0
    Mr wasin

    hello again of course add it, 3 nodes demo1,demo2,demo3
    pvecm create cluster for demo1 and demo2,demo3 pvecm add demo1 etc.

    it is possible to make the same demo with three nodes and ceph and verified if you have the same results when node1 (demo1> is down ?

    help a lot of people i think and needs know before transfer to production .

    or someone else to test the same 3 nodes cluster with ceph ?
     
  10. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    Hi Konstantinos,
    does 1. mean, that "rados -p mystorage bench 60 write" work ahen demo1 is down also?
    you mixed there something.
    PVE with ceph on the same nodes has two different (independet but with influences to each other) quoren.
    First pve-cluster - here you can look with
    Code:
    pvecm nodes
    pvecm status
    
    If you have quorum, /etc/pve is writable - e.g. you can do something like "touch /etc/pve/xx; rm /etc/pve/xx"

    The quorum in the ceph-cluster can controlled with
    Code:
    ceph health detail
    ceph -s
    
    Forget this with three or more nodes! 3 (+) Nodes are fine for an pve-cluster.

    If 1. is ok, then I assume that something is wrong with the storage.cfg-entry for your ceph-pool. E.G. if demo1 is the only accessible mon, the VMs can't reach the disks, if demo1 is down!

    Please post the output of following commands:
    Code:
    # on node demo1,2,3
    netstat -na | grep 6789
    
    # only from one node
    grep -A 5 rbd: /etc/pve/storage.cfg
    
    # also only from one node
    cat /etc/ceph/ceph.conf
    
    Udo
     
  11. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    Hi,
    yes for testing purpose (because I have an independet ceph-cluster). Work without trouble!

    Udo
     
  12. Konstantinos Pappas

    Konstantinos Pappas New Member

    Joined:
    Jan 7, 2015
    Messages:
    27
    Likes Received:
    0
    hello udo, i post the details tommorow.

    the problem focus when node1 (demo1) is down
    let me explain.
    prepare to create cluster nodes lets say 4 or 5 or 6 whaever
    total nodes 6

    if down for any reason node2, or node3, etc everything is going well.
    right now if node 1 for some reason is down, then start the problems.
    i try to understand how proxmox work.
    for your information i make test demo with 10 nodes
    when node1 shutting down, start the problem.

    so the conclusion is never shut down the node1?
     
  13. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    Hi,
    I have understand your issue, but you conclusion is wrong!

    Like I wrote before:
    all nodes in an pve-cluster (should) have the same votes (for the quorum). So it's makes no difference, if you shutdown the first node or another!

    The same is in an ceph-cluster: you can shutdown each mon, if enough mons to hold the quorum are alive.

    The same with OSD-Nodes (which are all the same in the pve-ceph-scenario): If one OSD-Node is down (equal which), the data is provided by the remaining OSD-hosts.
    On possible design-failure can be an crushmap with failure-domain on osd and not host (than you can have all copies on different OSDs, but there are all on one host!). But this is not your issue, because your ceph -s output before shows that 50% are degraded - the remaing 50% are enough.

    Like I wrote before, I assume the issue is in the access of the mons from the remaining nodes... please provide the output from my post before.

    Udo
     
  14. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,071
    Likes Received:
    24
    I had the suspicion of something like this and asked to confirm the storage.cfg. I think he already added other 2 mons in the storage.cfg.


    @Konstantinos
    I know 3 nodes Proxmox+Ceph works just fine. I used a 3 node cluster for months before scaling out. You were requested to post crushmap twice. Could you please post it in an easily understandable format? Just take a screenshot of the crushmap from Proxmox GUI. Please post screenshots of following:
    1. Crushmap (Proxmox GUI)
    2. storage.cfg (CLI)
    3. Pool list (Proxmox GUI)
    4. ceph.conf (CLI)

    Hide any information you dont want public.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  15. Konstantinos Pappas

    Konstantinos Pappas New Member

    Joined:
    Jan 7, 2015
    Messages:
    27
    Likes Received:
    0
    hello udo and wasin
    below i have the commands that ask me also include two attach file from crushmap and pools
    ////////////////////////////////////////////////////////////

    # demo1 - node1
    netstat -na | grep 6789
    root@demo1:~# netstat -na | grep 6789
    tcp 0 0 192.168.1.201:6789 0.0.0.0:* LISTEN
    tcp 0 9 192.168.1.201:6789 192.168.1.202:57155 ESTABLISHED
    tcp 0 0 192.168.1.201:6789 192.168.1.203:38122 ESTABLISHED
    tcp 0 0 192.168.1.201:52970 192.168.1.202:6789 TIME_WAIT
    tcp 0 0 192.168.1.201:52977 192.168.1.202:6789 TIME_WAIT
    tcp 0 0 192.168.1.201:52961 192.168.1.202:6789 TIME_WAIT
    tcp 0 0 192.168.1.201:60426 192.168.1.203:6789 ESTABLISHED
    tcp 0 0 192.168.1.201:52982 192.168.1.202:6789 TIME_WAIT
    tcp 0 0 192.168.1.201:37704 192.168.1.201:6789 TIME_WAIT
    tcp 0 0 192.168.1.201:6789 192.168.1.202:57278 ESTABLISHED
    tcp 0 0 192.168.1.201:37681 192.168.1.201:6789 TIME_WAIT
    tcp 0 0 192.168.1.201:52998 192.168.1.202:6789 TIME_WAIT
    tcp 0 0 192.168.1.201:52991 192.168.1.202:6789 TIME_WAIT
    tcp 0 9 192.168.1.201:6789 192.168.1.203:38101 ESTABLISHED
    tcp 0 0 192.168.1.201:60417 192.168.1.203:6789 ESTABLISHED
    tcp 0 0 192.168.1.201:52980 192.168.1.202:6789 TIME_WAIT
    tcp 0 0 192.168.1.201:32780 192.168.1.203:6789 TIME_WAIT
    tcp 0 0 192.168.1.201:37131 192.168.1.201:6789 ESTABLISHED
    tcp 0 0 192.168.1.201:52400 192.168.1.202:6789 ESTABLISHED
    tcp 0 0 192.168.1.201:6789 192.168.1.201:37131 ESTABLISHED
    tcp 0 0 192.168.1.201:6789 192.168.1.203:38169 ESTABLISHED

    # demo2 - node2
    netstat -na | grep 6789

    root@demo2:~# netstat -na | grep 6789
    tcp 0 0 192.168.1.202:6789 0.0.0.0:* LISTEN
    tcp 0 0 192.168.1.202:57155 192.168.1.201:6789 ESTABLISHED
    tcp 0 0 192.168.1.202:6789 192.168.1.203:59066 ESTABLISHED
    tcp 0 0 192.168.1.202:6789 192.168.1.203:58837 ESTABLISHED
    tcp 0 0 192.168.1.202:57278 192.168.1.201:6789 ESTABLISHED
    tcp 0 0 192.168.1.202:6789 192.168.1.203:59106 ESTABLISHED
    tcp 0 0 192.168.1.202:40888 192.168.1.203:6789 ESTABLISHED
    tcp 0 0 192.168.1.202:40931 192.168.1.203:6789 ESTABLISHED
    tcp 0 0 192.168.1.202:6789 192.168.1.201:52400 ESTABLISHED
    tcp 0 0 192.168.1.202:40961 192.168.1.203:6789 ESTABLISHED

    # demo3 - node3
    netstat -na | grep 6789

    root@demo3:~# netstat -na | grep 6789
    tcp 0 0 192.168.1.203:6789 0.0.0.0:* LISTEN
    tcp 0 0 192.168.1.203:59106 192.168.1.202:6789 ESTABLISHED
    tcp 0 0 192.168.1.203:6789 192.168.1.202:40961 ESTABLISHED
    tcp 0 0 192.168.1.203:38101 192.168.1.201:6789 ESTABLISHED
    tcp 0 0 192.168.1.203:59066 192.168.1.202:6789 ESTABLISHED
    tcp 0 0 192.168.1.203:38122 192.168.1.201:6789 ESTABLISHED
    tcp 0 0 192.168.1.203:6789 192.168.1.202:40931 ESTABLISHED
    tcp 0 0 192.168.1.203:58837 192.168.1.202:6789 ESTABLISHED
    tcp 0 0 192.168.1.203:6789 192.168.1.201:60426 ESTABLISHED
    tcp 0 0 192.168.1.203:6789 192.168.1.202:40888 ESTABLISHED
    tcp 0 0 192.168.1.203:38169 192.168.1.201:6789 ESTABLISHED
    tcp 0 0 192.168.1.203:6789 192.168.1.201:60417 ESTABLISHED

    # only from one node
    grep -A 5 rbd: /etc/pve/storage.cfg

    root@demo1:~# grep -A 5 rbd: /etc/pve/storage.cfg
    rbd: storage
    monhost 192.168.1.201;192.168.1.202;192.168.1.203
    pool storage
    content images
    nodes demo3,demo2,demo1
    username admin

    # also only from one node
    cat /etc/ceph/ceph.conf

    root@demo1:~# cat /etc/ceph/ceph.conf
    [global]
    auth client required = cephx
    auth cluster required = cephx
    auth service required = cephx
    auth supported = cephx
    cluster network = 192.168.1.0/24
    filestore xattr use omap = true
    fsid = 30d1a422-ba23-4191-9355-1d8609475f3f
    keyring = /etc/pve/priv/$cluster.$name.keyring
    osd journal size = 5120
    osd pool default min size = 1
    public network = 192.168.1.0/24

    [osd]
    keyring = /var/lib/ceph/osd/ceph-$id/keyring

    [mon.0]
    host = demo1
    mon addr = 192.168.1.201:6789

    [mon.1]
    host = demo2
    mon addr = 192.168.1.202:6789

    [mon.2]
    host = demo3
    mon addr = 192.168.1.203:6789
     
  16. Konstantinos Pappas

    Konstantinos Pappas New Member

    Joined:
    Jan 7, 2015
    Messages:
    27
    Likes Received:
    0
    here the images
    crushmap

    crushmap.png

    pool

    pool.png
     
  17. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    Hi,
    fine!
    this don't fit. In your screenshot the poolname/storagename is mystorage and here storage??

    Does it change anything if you use this monhosts:
    Code:
     	monhost 192.168.1.201 192.168.1.202 192.168.1.203
    
    and you don't answer, if an "rados bench" work with demo1 down or not.

    In the screenshot of the crushmap shows only the beginning.
    Can you do an
    Code:
    ceph osd crush dump -f json-pretty
    
    Udo
     
  18. Konstantinos Pappas

    Konstantinos Pappas New Member

    Joined:
    Jan 7, 2015
    Messages:
    27
    Likes Received:
    0
    Hello udo thanks a lot mate for the help
    i make new fresh installation so the ceph mystorage change to storage

    root@demo1:~# ceph osd crush dump -f json-pretty

    { "devices": [
    { "id": 0,
    "name": "osd.0"},
    { "id": 1,
    "name": "osd.1"},
    { "id": 2,
    "name": "osd.2"},
    { "id": 3,
    "name": "osd.3"},
    { "id": 4,
    "name": "osd.4"},
    { "id": 5,
    "name": "osd.5"},
    { "id": 6,
    "name": "osd.6"},
    { "id": 7,
    "name": "osd.7"},
    { "id": 8,
    "name": "osd.8"},
    { "id": 9,
    "name": "osd.9"},
    { "id": 10,
    "name": "osd.10"},
    { "id": 11,
    "name": "osd.11"}],
    "types": [
    { "type_id": 0,
    "name": "osd"},
    { "type_id": 1,
    "name": "host"},
    { "type_id": 2,
    "name": "chassis"},
    { "type_id": 3,
    "name": "rack"},
    { "type_id": 4,
    "name": "row"},
    { "type_id": 5,
    "name": "pdu"},
    { "type_id": 6,
    "name": "pod"},
    { "type_id": 7,
    "name": "room"},
    { "type_id": 8,
    "name": "datacenter"},
    { "type_id": 9,
    "name": "region"},
    { "type_id": 10,
    "name": "root"}],
    "buckets": [
    { "id": -1,
    "name": "default",
    "type_id": 10,
    "type_name": "root",
    "weight": 0,
    "alg": "straw",
    "hash": "rjenkins1",
    "items": [
    { "id": -2,
    "weight": 0,
    "pos": 0},
    { "id": -3,
    "weight": 0,
    "pos": 1},
    { "id": -4,
    "weight": 0,
    "pos": 2}]},
    { "id": -2,
    "name": "demo1",
    "type_id": 1,
    "type_name": "host",
    "weight": 0,
    "alg": "straw",
    "hash": "rjenkins1",
    "items": [
    { "id": 0,
    "weight": 0,
    "pos": 0},
    { "id": 1,
    "weight": 0,
    "pos": 1},
    { "id": 2,
    "weight": 0,
    "pos": 2},
    { "id": 3,
    "weight": 0,
    "pos": 3}]},
    { "id": -3,
    "name": "demo2",
    "type_id": 1,
    "type_name": "host",
    "weight": 0,
    "alg": "straw",
    "hash": "rjenkins1",
    "items": [
    { "id": 4,
    "weight": 0,
    "pos": 0},
    { "id": 5,
    "weight": 0,
    "pos": 1},
    { "id": 6,
    "weight": 0,
    "pos": 2},
    { "id": 7,
    "weight": 0,
    "pos": 3}]},
    { "id": -4,
    "name": "demo3",
    "type_id": 1,
    "type_name": "host",
    "weight": 0,
    "alg": "straw",
    "hash": "rjenkins1",
    "items": [
    { "id": 8,
    "weight": 0,
    "pos": 0},
    { "id": 9,
    "weight": 0,
    "pos": 1},
    { "id": 10,
    "weight": 0,
    "pos": 2},
    { "id": 11,
    "weight": 0,
    "pos": 3}]}],
    "rules": [
    { "rule_id": 0,
    "rule_name": "replicated_ruleset",
    "ruleset": 0,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
    { "op": "take",
    "item": -1,
    "item_name": "default"},
    { "op": "chooseleaf_firstn",
    "num": 0,
    "type": "host"},
    { "op": "emit"}]}],
    "tunables": { "choose_local_tries": 0,
    "choose_local_fallback_tries": 0,
    "choose_total_tries": 50,
    "chooseleaf_descend_once": 1,
    "profile": "bobtail",
    "optimal_tunables": 0,
    "legacy_tunables": 0,
    "require_feature_tunables": 1,
    "require_feature_tunables2": 1}}
     
  19. Konstantinos Pappas

    Konstantinos Pappas New Member

    Joined:
    Jan 7, 2015
    Messages:
    27
    Likes Received:
    0
    udo for your information i have the same problem, when the node1 is down freeze all
    when shutdown node2, or node3 everything all right

    pfffffffffff
     
  20. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,071
    Likes Received:
    24
    Hi Konstantinos,
    It was not necessary to do fresh installation. You could just delete the storage from Proxmox GUI and reattach it with proper pool name.


    From your crushmap, i am not understanding why your profile says bobtail. If you following Proxmox Ceph installation wiki (http://pve.proxmox.com/wiki/Ceph_Server) you should at least have dumpling, not Bobtail. I also dont see why all your OSD weights are "0". Unless i missed something, no where in your crushmap shows any weight for any OSD.

    Are you doing this on a virtual environment or are the nodes actually physical nodes?

    We totally understand your issue and what the problem is. Please dont repeat the issue.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice