Ceph OSD failure causing Proxmox node to crash

wahmed · Jan 19, 2015

Over the weekend 2 HDD(OSD) died in one of our Ceph cluster. After i replaced them as usual cluster went in to rebalancing mode. But since then i cannot access the RBD storage from Proxmox. It also making access to some of the Proxmox node inaccessible through GUI. If i login from working GUI on other node, some of the node keeps giving connection error. Syslog shows all nodes are having the following message.

Code:

Jan 19 04:18:10 00-01-01-21 pveproxy[363619]: WARNING: proxy detected vanished client connection

Nodes that are inaccessible has higher number of this messages. No VM cannot be started. Node cannot even access NFS share without connection error. As soon as i disable RBD storage, Proxmox cluster becomes normal. All connection error goes away, can access NFS share and all other share. But of course I no longer have RBD thus no VM.
Any idea?

udo · Jan 19, 2015

symmcom said:
Over the weekend 2 HDD(OSD) died in one of our Ceph cluster. After i replaced them as usual cluster went in to rebalancing mode. But since then i cannot access the RBD storage from Proxmox. It also making access to some of the Proxmox node inaccessible through GUI. If i login from working GUI on other node, some of the node keeps giving connection error. Syslog shows all nodes are having the following message.

Code:

Jan 19 04:18:10 00-01-01-21 pveproxy[363619]: WARNING: proxy detected vanished client connection

Nodes that are inaccessible has higher number of this messages. No VM cannot be started. Node cannot even access NFS share without connection error. As soon as i disable RBD storage, Proxmox cluster becomes normal. All connection error goes away, can access NFS share and all other share. But of course I no longer have RBD thus no VM.
Any idea?

Hi Wasim,
what is the output of

Code:

ceph -s
ceph pg dump_stuck inactive
ceph pg dump_stuck stale
ceph pg dump_stuck unclean

And do you have following lines in your ceph.conf osd-section?

Code:

osd max backfills = 1
osd recovery max active = 1

Udo

wahmed · Jan 19, 2015

udo said:
[/code]
And do you have following lines in your ceph.conf osd-section?

Code:

osd max backfills = 1 osd recovery max active = 1

Udo

I had this for last year. Over the weekend i changed it to much higher since i could not access the RBD from and those 2 OSDs left the cluster with multiple Request Blocked > 2 secs; 2 slow OSDs messages. I thought more than 2 HDD failed so i replaced them also. But it seems like everytime i down/out an OSD, another one takes it place with slow OSD msg. After over 48 hours of trying many things i now decided to reduce OSDs to bare minimum with hope that those slow OSD msg will go away. Short version, the cluster is going through a whole lot of rebalancing right now to show dump_stuck output.

But there were 3 PGs that were stuck inactive since Friday when first HDD died. With hope to repair those is what intrigued the whole ordeal.
#ceph pg dump_stuck inactive

Code:

pg_stat objects mip     degr    unf     bytes   log     disklog state   state_stamp     v       reported        up      up_primary      acting  acting_primary  last_scrub      scrub_stamp     last_deep_scrub deep_scrub_stamp
9.11f   0       0       0       0       0       4262    4262    down+incomplete 2015-01-19 04:49:51.030072      65870'247873    77987:1237      [1,10,11]       1       [1,10,11]       1       63054'241792    2015-01-16 07:29:48.680838      61530'214476    2015-01-10 04:22:49.251984
9.2a    0       0       0       0       0       0       0       incomplete      2015-01-19 04:49:51.026809      0'0     77987:2212      [22,20,2]      22       [22,20,2]       22      62856'127824    2015-01-16 03:26:50.913162      61530'107402    2015-01-09 20:03:18.294603
9.16f   0       0       0       0       0       0       0       incomplete      2015-01-19 04:49:51.032679      0'0     77987:1513      [19,6,5]       19       [19,6,5]        19      63054'96345     2015-01-16 07:49:27.214299      61530'70570     2015-01-10 04:46:12.217642

udo · Jan 19, 2015

symmcom said:
I had this for last year. Over the weekend i changed it to much higher since i could not access the RBD from and those 2 OSDs left the cluster with multiple Request Blocked > 2 secs; 2 slow OSDs messages. I thought more than 2 HDD failed so i replaced them also.

Hi,
do I understand you right, you (or ceph plus the manual disabled) stopped more than two OSDs at a time? In this case (by an replica of 3) you must lost data!

But it seems like everytime i down/out an OSD, another one takes it place with slow OSD msg. After over 48 hours of trying many things i now decided to reduce OSDs to bare minimum with hope that those slow OSD msg will go away. Short version, the cluster is going through a whole lot of rebalancing right now to show dump_stuck output.

Hmm, slow osd msg came AFAIK due HW-Limitations (network-speed, cpu-power, RAM, to much IO due recovery). To reduce the OSDs should only help, if the issue is cpu-power/Ram...

Udo

wahmed · Jan 19, 2015

udo said:
do I understand you right, you (or ceph plus the manual disabled) stopped more than two OSDs at a time? In this case (by an replica of 3) you must lost data!

No. I replaced one at a time. Replaced one, wait for rebalancing then next one.
Would blocked request prevent read thus preventing Proxmox from accessing the storage?

udo · Jan 19, 2015

symmcom said:
Would blocked request prevent read thus preventing Proxmox from accessing the storage?

I think so

udo · Jan 19, 2015

symmcom said:
No. I replaced one at a time. Replaced one, wait for rebalancing then next one.

But this explained not the "down+incomplete" PG

wahmed · Jan 19, 2015

# ceph -s as of now

Code:

 health HEALTH_WARN 1 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 pgs stuck unclean; 7 requests are blocked > 32 sec
     monmap e33: 5 mons at {1=10.0.100.5:6789/0,2=10.0.100.13:6789/0,3=10.0.100.6:6789/0,4=10.0.100.17:6789/0,6=10.0.100.7:6789/0}, election epoch 1830, quorum 0,1,2,3,4 1,3,6,2,4
     mdsmap e120: 0/0/1 up
     osdmap e78282: 19 osds: 19 up, 19 in
      pgmap v14161208: 1152 pgs, 3 pools, 3185 GB data, 807 kobjects
            9688 GB used, 25597 GB / 35285 GB avail
                   1 down+incomplete
                1141 active+clean
                   2 incomplete
                   8 active+clean+scrubbing+deep

udo · Jan 19, 2015

symmcom said:

# ceph -s as of now

Code:

 health HEALTH_WARN 1 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 pgs stuck unclean; 7 requests are blocked > 32 sec
     monmap e33: 5 mons at {1=10.0.100.5:6789/0,2=10.0.100.13:6789/0,3=10.0.100.6:6789/0,4=10.0.100.17:6789/0,6=10.0.100.7:6789/0}, election epoch 1830, quorum 0,1,2,3,4 1,3,6,2,4
     mdsmap e120: 0/0/1 up
     osdmap e78282: 19 osds: 19 up, 19 in
      pgmap v14161208: 1152 pgs, 3 pools, 3185 GB data, 807 kobjects
            9688 GB used, 25597 GB / 35285 GB avail
                   1 down+incomplete
                1141 active+clean
                   2 incomplete
                   8 active+clean+scrubbing+deep

Hi,
the problem is the down+incomplete.
Is the output actual?

Code:

9.11f   0       0       0       0       0       4262    4262    down+incomplete 2015-01-19 04:49:51.030072      65870'247873    77987:1237      [1,10,11]       1       [1,10,11]       1       63054'241792    2015-01-16 07:29:48.680838      61530'214476    2015-01-10 04:22:49.251984

and how looks your

Code:

ceph osd tree

esp. for osd 1,10+11?

Is osd.1 dead? If not realy, perhaps you can bring osd.1 up, to solve the down+incomplete and if this work you can reweigt osd.1 to 0 to move all data to other disks.

Or you try an repair?

Udo

wahmed · Jan 19, 2015

udo said:
Hi,
the problem is the down+incomplete.
Is the output actual?

Code:

9.11f 0 0 0 0 0 4262 4262 down+incomplete 2015-01-19 04:49:51.030072 65870'247873 77987:1237 [1,10,11] 1 [1,10,11] 1 63054'241792 2015-01-16 07:29:48.680838 61530'214476 2015-01-10 04:22:49.251984

The output is actual.

udo said:

Is osd.1 dead? If not realy, perhaps you can bring osd.1 up, to solve the down+incomplete and if this work you can reweigt osd.1 to 0 to move all data to other disks.

Click to expand...

#ceph osd tree

Code:

-9 9.05 host CA-00-01-01-17 0 1.81 osd.0 up 1 2 1.81 osd.2 up 1 4 1.81 osd.4 up 1 9 1.81 osd.9 up 1 12 1.81 osd.12 up 1 -10 7.24 host CA-00-01-01-18 13 1.81 osd.13 up 1 10 1.81 osd.10 up 1 19 1.81 osd.19 up 1 20 1.81 osd.20 up 1 -11 9.05 host CA-00-01-01-19 6 1.81 osd.6 up 1 7 1.81 osd.7 up 1 8 1.81 osd.8 up 1 11 1.81 osd.11 up 1 14 1.81 osd.14 up 1 -12 0 host CA-00-01-01-20 -13 9.05 host CA-00-01-01-21 1 1.81 osd.1 up 1 21 1.81 osd.21 up 1 22 1.81 osd.22 up 1 23 1.81 osd.23 up 1 5 1.81 osd.5 up 1

Repairing PGs has no effect. Is there way to remove or recontruct these PGs? At this moment i am willing to take some data loss whatever attached to these 3 PGs. Just need the cluster accessible.
I also tried to export VM images from RBD pool. Connection to RBD storage drops after 1-2% and just sits there. For couple minutes cluster shows good Read, then quiet.

udo · Jan 19, 2015

symmcom said:
udo said:

Hi,
the problem is the down+incomplete.
Is the output actual?

Code:

9.11f 0 0 0 0 0 4262 4262 down+incomplete 2015-01-19 04:49:51.030072 65870'247873 77987:1237 [1,10,11] 1 [1,10,11] 1 63054'241792 2015-01-16 07:29:48.680838 61530'214476 2015-01-10 04:22:49.251984

The output is actual.

#ceph osd tree

Code:

-9 9.05 host CA-00-01-01-17 0 1.81 osd.0 up 1 2 1.81 osd.2 up 1 4 1.81 osd.4 up 1 9 1.81 osd.9 up 1 12 1.81 osd.12 up 1 -10 7.24 host CA-00-01-01-18 13 1.81 osd.13 up 1 10 1.81 osd.10 up 1 19 1.81 osd.19 up 1 20 1.81 osd.20 up 1 -11 9.05 host CA-00-01-01-19 6 1.81 osd.6 up 1 7 1.81 osd.7 up 1 8 1.81 osd.8 up 1 11 1.81 osd.11 up 1 14 1.81 osd.14 up 1 -12 0 host CA-00-01-01-20 -13 9.05 host CA-00-01-01-21 1 1.81 osd.1 up 1 21 1.81 osd.21 up 1 22 1.81 osd.22 up 1 23 1.81 osd.23 up 1 5 1.81 osd.5 up 1

Repairing PGs has no effect. Is there way to remove or recontruct these PGs? At this moment i am willing to take some data loss whatever attached to these 3 PGs. Just need the cluster accessible.
I also tried to export VM images from RBD pool. Connection to RBD storage drops after 1-2% and just sits there. For couple minutes cluster shows good Read, then quiet.

Click to expand...

Hi,
strange... all OSDs up.

I had an simmiliar issue, where ceph stops rebuild after backfill_too_full. Stop and starting ceph on one osd-node bring all back.
So I would try to stop ceph on CA-00-01-01-21 (as beginning), start there again and look if anything rebuilded/changed. After that try the next node.

ceph -s shows on all mons the same pgmap/monmap?

Udo

wahmed · Jan 19, 2015

udo said:
symmcom said:

Hi,
strange... all OSDs up.

I had an simmiliar issue, where ceph stops rebuild after backfill_too_full. Stop and starting ceph on one osd-node bring all back.
So I would try to stop ceph on CA-00-01-01-21 (as beginning), start there again and look if anything rebuilded/changed. After that try the next node.

ceph -s shows on all mons the same pgmap/monmap?

Udo

Click to expand...

Rebooted each Ceph node one by one, no effect. Repair, Force_Recreate, Scrub of PG also didnt do anything. Any way i can remove these 3 PGs forcefully?

Trying to export the VM images off the RBD storage. But the download hangs after 1%. As soon as i start copying the image i can see Ceph read jumps to 20-25mb/s. Then after a minute or two it stops and read goes back to 0 bytes. If i can extract some of the VM images, i will forget about the ceph cluster and recreate from scratch.

From proxmox gui i can see the RBD Storage stats such as total size, graph etc under Summary. When click on Content after seconds it shows Connection Error.

udo · Jan 19, 2015

symmcom said:
udo said:

Rebooted each Ceph node one by one, no effect. Repair, Force_Recreate, Scrub of PG also didnt do anything. Any way i can remove these 3 PGs forcefully?

Trying to export the VM images off the RBD storage. But the download hangs after 1%. As soon as i start copying the image i can see Ceph read jumps to 20-25mb/s. Then after a minute or two it stops and read goes back to 0 bytes. If i can extract some of the VM images, i will forget about the ceph cluster and recreate from scratch.

From proxmox gui i can see the RBD Storage stats such as total size, graph etc under Summary. When click on Content after seconds it shows Connection Error.

Click to expand...

Hi Wasim,
find the corrupted VM-disks.

if you remove the files, the placementgroup schould be healthy again.

To find the vm-files:

Code:

find /var/lib/ceph/osd/ceph-1/current/9.11f_head/ | grep "\\udata\." | awk -F . '{print $3}' | sort -u

gives an output over all prefixes uses in the PG.

A sample how get this together:

Code:

find /var/lib/ceph/osd/ceph-1/current/2.6a_head/ | grep "\\udata\." | awk -F . '{print $3}' | sort -u 1506c62ae8944a 257a22ae8944a 3e4efa2ae8944a 5819a74b0dc51 ## pool is in this case rbd rbd -p rbd ls vm-245-disk-1 vm-245-disk-2 vm-499-disk-1 vm-499-disk-2 ## you see - it's ugly: parts of all disks are in this one PG rbd info rbd/vm-245-disk-1 rbd image 'vm-245-disk-1': size 3072 GB in 786432 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.257a22ae8944a format: 2 features: layering

block_name_prefix: rbd_data.257a22ae8944a match the output of the find.

BTW. you can try to set reweigt from osd.1 to 0 and after rebalance only PG 9.11f should on the OSD. After that you can say ceph, that the OSD is lost... Don't know if this helps, because repair don't work for me either as I had trouble with one defective PG (but this error was gone after deleting benchmark-data (the test pool) where the problematic PG-Data came from).

Udo

udo · Jan 19, 2015

Hi again,
any info in the log during repair?

Code:

tail -f /var/log/ceph/ceph-osd.1.log

wahmed · Jan 20, 2015

I found the vm image that might be contributing into this issue. But i cannot seem to delete it. Everytime i start deletion it roughly deletes upto 12% than hangs.

udo · Jan 20, 2015

symmcom said:
I found the vm image that might be contributing into this issue. But i cannot seem to delete it. Everytime i start deletion it roughly deletes upto 12% than hangs.

Hi,
have you tried to delete the 4MB-Chunks via rados?

Like this for an vm-disk with prefix 44d4262ae8944a

Code:

rados rm -p rbd rbd_data.44d4262ae8944a.0000000000000466
...

Udo

wahmed · Jan 20, 2015

udo said:
Hi,
have you tried to delete the 4MB-Chunks via rados?

Like this for an vm-disk with prefix 44d4262ae8944a

Code:

rados rm -p rbd rbd_data.44d4262ae8944a.0000000000000466 ...

No go Udo.

I cannot even make Ceph show me a list of all the objects for the VM image. It gets stuck half way scrolling header prefixes.

An executive decision has been made. We are leaving this Ceph pool behind. We already restored all VMs from recent backups and move 3 good images from Ceph cluster. But i would not say it was a total loss at all. The amount of insight and commands that i documented will go a very long way tackle future similar issues. I was not even aware half of the commands i had to punch in to fix this PG issue.
Luckily it happened on one of the less "important" production cluster. It could have been worse.

Thank you Udo for all the help. It was extremely valuable.

I created several pools on this damaged cluster and connected them with proxmox. I can access the pools absolutely no issue. Store VM images, delete etc. So the problem is with this particular Ceph pool itself. We are just going to delete it and move on. I think from now on we will take full advantage of Pools in Ceph. Create separate pools for different nature/requirements of VMs.

Somewhere in this thread Udo you have pointed out that taking out 2 HDDs at the same time for Replica 3 was a NO NO. What about an incident where 2 HDDs fails simultaneously? Wouldnt it heal itself? Or say entire node faills and it takes several days to get it up again. Wouldnt the data be safe without causing this sort of PG issue?

On the new pools i have set Replica 3 with minimum size 2. What do you think?

udo · Jan 20, 2015

symmcom said:
No go Udo.
I cannot even make Ceph show me a list of all the objects for the VM image. It gets stuck half way scrolling header prefixes.

what a shame.

I think from now on we will take full advantage of Pools in Ceph. Create separate pools for different nature/requirements of VMs.

perhaps also not an bad idea for me, to use more than one production pools...

Somewhere in this thread Udo you have pointed out that taking out 2 HDDs at the same time for Replica 3 was a NO NO.

Hi,
this was an misunderstanding. I wrote "more than two OSDs".

What about an incident where 2 HDDs fails simultaneously? Wouldnt it heal itself? Or say entire node faills and it takes several days to get it up again. Wouldnt the data be safe without causing this sort of PG issue?

in theory this can't happens... but you see it happens. With replica 3 two OSDs can die and the cluster can rebuild.
Perhaps you should not erase your pool to fast. Try to find help on the ceph-user mailingliste (ok, ofter there are not so much response on this list).

On the new pools i have set Replica 3 with minimum size 2. What do you think?

hmm,
for safety reasons perhaps not bad, but I'm not sure that your issue was impossible with this setting. The cluster will block writes after the second disk.outage... e.g. your pve-cluster hang too. If after rebuild all all works like before, it's perhaps ok...

I think I would stay with minimum size 1.

Udo

wahmed · Jan 20, 2015

udo said:
I think I would stay with minimum size 1.

Perhaps i have wrong idea about the minimum size. What is it exactly?
What i understand as of now this is the minimum size that cluster will acknowledge write operation. so with min size of 2 cluster ensures that there are always 2 replicas. With min size 1 if a single copy dies somehow as in my case, there are no more replica left to rebuild.

I am thinking of couple other clusters we are managing which are destined to grow upto petabyte. With so many osds and nodes min size greater than 1 makes sense. They are sitting on ubuntu but in the plan to move to proxmox.

Konstantinos Pappas · Jan 20, 2015

hello udo and wasim
wasim if i understand correctly, you have 3 nodes with 4 osds per node ?
it is necessary replica 1 than 2?
if two osd turn out the same time that means you loose data??
as remember well inside documentation ceph it is necessary to delete manually these disk and replace with brand new disks and put it on production.

by the way there is something that i must now ?
i ask because i prepare to covert all my data center to ceph storage.
thanks both

Ceph OSD failure causing Proxmox node to crash

Famous Member

Distinguished Member

Famous Member

Distinguished Member

Famous Member

Distinguished Member

Distinguished Member

Famous Member

Distinguished Member

Famous Member

Distinguished Member

Famous Member

Distinguished Member

Distinguished Member

Famous Member

Distinguished Member

Famous Member

Distinguished Member

Famous Member

New Member