Ceph OSD failure causing Proxmox node to crash

Discussion in 'Proxmox VE: Installation and configuration' started by symmcom, Jan 19, 2015.

  1. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,066
    Likes Received:
    24
    Over the weekend 2 HDD(OSD) died in one of our Ceph cluster. After i replaced them as usual cluster went in to rebalancing mode. But since then i cannot access the RBD storage from Proxmox. It also making access to some of the Proxmox node inaccessible through GUI. If i login from working GUI on other node, some of the node keeps giving connection error. Syslog shows all nodes are having the following message.
    Code:
    Jan 19 04:18:10 00-01-01-21 pveproxy[363619]: WARNING: proxy detected vanished client connection
    Nodes that are inaccessible has higher number of this messages. No VM cannot be started. Node cannot even access NFS share without connection error. As soon as i disable RBD storage, Proxmox cluster becomes normal. All connection error goes away, can access NFS share and all other share. But of course I no longer have RBD thus no VM.
    Any idea?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  2. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    Hi Wasim,
    what is the output of
    Code:
    ceph -s
    ceph pg dump_stuck inactive
    ceph pg dump_stuck stale
    ceph pg dump_stuck unclean
    
    And do you have following lines in your ceph.conf osd-section?
    Code:
    osd max backfills = 1
    osd recovery max active = 1
    
    Udo
     
  3. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,066
    Likes Received:
    24
    I had this for last year. Over the weekend i changed it to much higher since i could not access the RBD from and those 2 OSDs left the cluster with multiple Request Blocked > 2 secs; 2 slow OSDs messages. I thought more than 2 HDD failed so i replaced them also. But it seems like everytime i down/out an OSD, another one takes it place with slow OSD msg. After over 48 hours of trying many things i now decided to reduce OSDs to bare minimum with hope that those slow OSD msg will go away. Short version, the cluster is going through a whole lot of rebalancing right now to show dump_stuck output.

    But there were 3 PGs that were stuck inactive since Friday when first HDD died. With hope to repair those is what intrigued the whole ordeal.
    #ceph pg dump_stuck inactive
    Code:
    pg_stat objects mip     degr    unf     bytes   log     disklog state   state_stamp     v       reported        up      up_primary      acting  acting_primary  last_scrub      scrub_stamp     last_deep_scrub deep_scrub_stamp
    9.11f   0       0       0       0       0       4262    4262    down+incomplete 2015-01-19 04:49:51.030072      65870'247873    77987:1237      [1,10,11]       1       [1,10,11]       1       63054'241792    2015-01-16 07:29:48.680838      61530'214476    2015-01-10 04:22:49.251984
    9.2a    0       0       0       0       0       0       0       incomplete      2015-01-19 04:49:51.026809      0'0     77987:2212      [22,20,2]      22       [22,20,2]       22      62856'127824    2015-01-16 03:26:50.913162      61530'107402    2015-01-09 20:03:18.294603
    9.16f   0       0       0       0       0       0       0       incomplete      2015-01-19 04:49:51.032679      0'0     77987:1513      [19,6,5]       19       [19,6,5]        19      63054'96345     2015-01-16 07:49:27.214299      61530'70570     2015-01-10 04:46:12.217642
    
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  4. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    Hi,
    do I understand you right, you (or ceph plus the manual disabled) stopped more than two OSDs at a time? In this case (by an replica of 3) you must lost data!
    Hmm, slow osd msg came AFAIK due HW-Limitations (network-speed, cpu-power, RAM, to much IO due recovery). To reduce the OSDs should only help, if the issue is cpu-power/Ram...

    Udo
     
  5. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,066
    Likes Received:
    24
    No. I replaced one at a time. Replaced one, wait for rebalancing then next one.
    Would blocked request prevent read thus preventing Proxmox from accessing the storage?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    I think so
     
  7. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    But this explained not the "down+incomplete" PG
     
  8. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,066
    Likes Received:
    24
    # ceph -s as of now
    Code:
     health HEALTH_WARN 1 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 pgs stuck unclean; 7 requests are blocked > 32 sec
         monmap e33: 5 mons at {1=10.0.100.5:6789/0,2=10.0.100.13:6789/0,3=10.0.100.6:6789/0,4=10.0.100.17:6789/0,6=10.0.100.7:6789/0}, election epoch 1830, quorum 0,1,2,3,4 1,3,6,2,4
         mdsmap e120: 0/0/1 up
         osdmap e78282: 19 osds: 19 up, 19 in
          pgmap v14161208: 1152 pgs, 3 pools, 3185 GB data, 807 kobjects
                9688 GB used, 25597 GB / 35285 GB avail
                       1 down+incomplete
                    1141 active+clean
                       2 incomplete
                       8 active+clean+scrubbing+deep
    
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  9. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    Hi,
    the problem is the down+incomplete.
    Is the output actual?
    Code:
    9.11f   0       0       0       0       0       4262    4262    down+incomplete 2015-01-19 04:49:51.030072      65870'247873    77987:1237      [1,10,11]       1       [1,10,11]       1       63054'241792    2015-01-16 07:29:48.680838      61530'214476    2015-01-10 04:22:49.251984
    
    and how looks your
    Code:
    ceph osd tree
    
    esp. for osd 1,10+11?

    Is osd.1 dead? If not realy, perhaps you can bring osd.1 up, to solve the down+incomplete and if this work you can reweigt osd.1 to 0 to move all data to other disks.

    Or you try an repair?

    Udo
     
  10. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,066
    Likes Received:
    24
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  11. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
     
  12. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,066
    Likes Received:
    24
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    #12 symmcom, Jan 19, 2015
    Last edited: Jan 19, 2015
  13. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
     
    #13 udo, Jan 19, 2015
    Last edited: Jan 19, 2015
  14. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    Hi again,
    any info in the log during repair?
    Code:
    tail -f /var/log/ceph/ceph-osd.1.log
    
     
  15. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,066
    Likes Received:
    24
    I found the vm image that might be contributing into this issue. But i cannot seem to delete it. Everytime i start deletion it roughly deletes upto 12% than hangs.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  16. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    Hi,
    have you tried to delete the 4MB-Chunks via rados?

    Like this for an vm-disk with prefix 44d4262ae8944a
    Code:
    rados rm -p rbd rbd_data.44d4262ae8944a.0000000000000466
    ...
    
    Udo
     
  17. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,066
    Likes Received:
    24
    No go Udo. :(
    I cannot even make Ceph show me a list of all the objects for the VM image. It gets stuck half way scrolling header prefixes.

    An executive decision has been made. We are leaving this Ceph pool behind. We already restored all VMs from recent backups and move 3 good images from Ceph cluster. But i would not say it was a total loss at all. The amount of insight and commands that i documented will go a very long way tackle future similar issues. I was not even aware half of the commands i had to punch in to fix this PG issue.
    Luckily it happened on one of the less "important" production cluster. It could have been worse. :)
    Thank you Udo for all the help. It was extremely valuable.

    I created several pools on this damaged cluster and connected them with proxmox. I can access the pools absolutely no issue. Store VM images, delete etc. So the problem is with this particular Ceph pool itself. We are just going to delete it and move on. I think from now on we will take full advantage of Pools in Ceph. Create separate pools for different nature/requirements of VMs.

    Somewhere in this thread Udo you have pointed out that taking out 2 HDDs at the same time for Replica 3 was a NO NO. What about an incident where 2 HDDs fails simultaneously? Wouldnt it heal itself? Or say entire node faills and it takes several days to get it up again. Wouldnt the data be safe without causing this sort of PG issue?

    On the new pools i have set Replica 3 with minimum size 2. What do you think?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  18. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    what a shame.
    perhaps also not an bad idea for me, to use more than one production pools...
    Hi,
    this was an misunderstanding. I wrote "more than two OSDs".
    in theory this can't happens... but you see it happens. With replica 3 two OSDs can die and the cluster can rebuild.
    Perhaps you should not erase your pool to fast. Try to find help on the ceph-user mailingliste (ok, ofter there are not so much response on this list).
    hmm,
    for safety reasons perhaps not bad, but I'm not sure that your issue was impossible with this setting. The cluster will block writes after the second disk.outage... e.g. your pve-cluster hang too. If after rebuild all all works like before, it's perhaps ok...

    I think I would stay with minimum size 1.

    Udo
     
  19. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,066
    Likes Received:
    24
    Perhaps i have wrong idea about the minimum size. What is it exactly?
    What i understand as of now this is the minimum size that cluster will acknowledge write operation. so with min size of 2 cluster ensures that there are always 2 replicas. With min size 1 if a single copy dies somehow as in my case, there are no more replica left to rebuild.

    I am thinking of couple other clusters we are managing which are destined to grow upto petabyte. With so many osds and nodes min size greater than 1 makes sense. They are sitting on ubuntu but in the plan to move to proxmox.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    #19 symmcom, Jan 20, 2015
    Last edited: Jan 20, 2015
  20. Konstantinos Pappas

    Konstantinos Pappas New Member

    Joined:
    Jan 7, 2015
    Messages:
    27
    Likes Received:
    0
    hello udo and wasim
    wasim if i understand correctly, you have 3 nodes with 4 osds per node ?
    it is necessary replica 1 than 2?
    if two osd turn out the same time that means you loose data??
    as remember well inside documentation ceph it is necessary to delete manually these disk and replace with brand new disks and put it on production.

    by the way there is something that i must now ?
    i ask because i prepare to covert all my data center to ceph storage.
    thanks both
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice