Ceph OSD failure causing Proxmox node to crash

wahmed

Famous Member
Oct 28, 2012
1,118
46
113
Calgary, Canada
www.symmcom.com
Over the weekend 2 HDD(OSD) died in one of our Ceph cluster. After i replaced them as usual cluster went in to rebalancing mode. But since then i cannot access the RBD storage from Proxmox. It also making access to some of the Proxmox node inaccessible through GUI. If i login from working GUI on other node, some of the node keeps giving connection error. Syslog shows all nodes are having the following message.
Code:
Jan 19 04:18:10 00-01-01-21 pveproxy[363619]: WARNING: proxy detected vanished client connection
Nodes that are inaccessible has higher number of this messages. No VM cannot be started. Node cannot even access NFS share without connection error. As soon as i disable RBD storage, Proxmox cluster becomes normal. All connection error goes away, can access NFS share and all other share. But of course I no longer have RBD thus no VM.
Any idea?
 
Over the weekend 2 HDD(OSD) died in one of our Ceph cluster. After i replaced them as usual cluster went in to rebalancing mode. But since then i cannot access the RBD storage from Proxmox. It also making access to some of the Proxmox node inaccessible through GUI. If i login from working GUI on other node, some of the node keeps giving connection error. Syslog shows all nodes are having the following message.
Code:
Jan 19 04:18:10 00-01-01-21 pveproxy[363619]: WARNING: proxy detected vanished client connection
Nodes that are inaccessible has higher number of this messages. No VM cannot be started. Node cannot even access NFS share without connection error. As soon as i disable RBD storage, Proxmox cluster becomes normal. All connection error goes away, can access NFS share and all other share. But of course I no longer have RBD thus no VM.
Any idea?

Hi Wasim,
what is the output of
Code:
ceph -s
ceph pg dump_stuck inactive
ceph pg dump_stuck stale
ceph pg dump_stuck unclean
And do you have following lines in your ceph.conf osd-section?
Code:
osd max backfills = 1
osd recovery max active = 1
Udo
 
[/code]
And do you have following lines in your ceph.conf osd-section?
Code:
osd max backfills = 1
osd recovery max active = 1
Udo

I had this for last year. Over the weekend i changed it to much higher since i could not access the RBD from and those 2 OSDs left the cluster with multiple Request Blocked > 2 secs; 2 slow OSDs messages. I thought more than 2 HDD failed so i replaced them also. But it seems like everytime i down/out an OSD, another one takes it place with slow OSD msg. After over 48 hours of trying many things i now decided to reduce OSDs to bare minimum with hope that those slow OSD msg will go away. Short version, the cluster is going through a whole lot of rebalancing right now to show dump_stuck output.

But there were 3 PGs that were stuck inactive since Friday when first HDD died. With hope to repair those is what intrigued the whole ordeal.
#ceph pg dump_stuck inactive
Code:
pg_stat objects mip     degr    unf     bytes   log     disklog state   state_stamp     v       reported        up      up_primary      acting  acting_primary  last_scrub      scrub_stamp     last_deep_scrub deep_scrub_stamp
9.11f   0       0       0       0       0       4262    4262    down+incomplete 2015-01-19 04:49:51.030072      65870'247873    77987:1237      [1,10,11]       1       [1,10,11]       1       63054'241792    2015-01-16 07:29:48.680838      61530'214476    2015-01-10 04:22:49.251984
9.2a    0       0       0       0       0       0       0       incomplete      2015-01-19 04:49:51.026809      0'0     77987:2212      [22,20,2]      22       [22,20,2]       22      62856'127824    2015-01-16 03:26:50.913162      61530'107402    2015-01-09 20:03:18.294603
9.16f   0       0       0       0       0       0       0       incomplete      2015-01-19 04:49:51.032679      0'0     77987:1513      [19,6,5]       19       [19,6,5]        19      63054'96345     2015-01-16 07:49:27.214299      61530'70570     2015-01-10 04:46:12.217642
 
I had this for last year. Over the weekend i changed it to much higher since i could not access the RBD from and those 2 OSDs left the cluster with multiple Request Blocked > 2 secs; 2 slow OSDs messages. I thought more than 2 HDD failed so i replaced them also.
Hi,
do I understand you right, you (or ceph plus the manual disabled) stopped more than two OSDs at a time? In this case (by an replica of 3) you must lost data!
But it seems like everytime i down/out an OSD, another one takes it place with slow OSD msg. After over 48 hours of trying many things i now decided to reduce OSDs to bare minimum with hope that those slow OSD msg will go away. Short version, the cluster is going through a whole lot of rebalancing right now to show dump_stuck output.
Hmm, slow osd msg came AFAIK due HW-Limitations (network-speed, cpu-power, RAM, to much IO due recovery). To reduce the OSDs should only help, if the issue is cpu-power/Ram...

Udo
 
do I understand you right, you (or ceph plus the manual disabled) stopped more than two OSDs at a time? In this case (by an replica of 3) you must lost data!
No. I replaced one at a time. Replaced one, wait for rebalancing then next one.
Would blocked request prevent read thus preventing Proxmox from accessing the storage?
 
# ceph -s as of now
Code:
 health HEALTH_WARN 1 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 pgs stuck unclean; 7 requests are blocked > 32 sec
     monmap e33: 5 mons at {1=10.0.100.5:6789/0,2=10.0.100.13:6789/0,3=10.0.100.6:6789/0,4=10.0.100.17:6789/0,6=10.0.100.7:6789/0}, election epoch 1830, quorum 0,1,2,3,4 1,3,6,2,4
     mdsmap e120: 0/0/1 up
     osdmap e78282: 19 osds: 19 up, 19 in
      pgmap v14161208: 1152 pgs, 3 pools, 3185 GB data, 807 kobjects
            9688 GB used, 25597 GB / 35285 GB avail
                   1 down+incomplete
                1141 active+clean
                   2 incomplete
                   8 active+clean+scrubbing+deep
 
# ceph -s as of now
Code:
 health HEALTH_WARN 1 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 pgs stuck unclean; 7 requests are blocked > 32 sec
     monmap e33: 5 mons at {1=10.0.100.5:6789/0,2=10.0.100.13:6789/0,3=10.0.100.6:6789/0,4=10.0.100.17:6789/0,6=10.0.100.7:6789/0}, election epoch 1830, quorum 0,1,2,3,4 1,3,6,2,4
     mdsmap e120: 0/0/1 up
     osdmap e78282: 19 osds: 19 up, 19 in
      pgmap v14161208: 1152 pgs, 3 pools, 3185 GB data, 807 kobjects
            9688 GB used, 25597 GB / 35285 GB avail
                   1 down+incomplete
                1141 active+clean
                   2 incomplete
                   8 active+clean+scrubbing+deep

Hi,
the problem is the down+incomplete.
Is the output actual?
Code:
9.11f   0       0       0       0       0       4262    4262    down+incomplete 2015-01-19 04:49:51.030072      65870'247873    77987:1237      [1,10,11]       1       [1,10,11]       1       63054'241792    2015-01-16 07:29:48.680838      61530'214476    2015-01-10 04:22:49.251984
and how looks your
Code:
ceph osd tree
esp. for osd 1,10+11?

Is osd.1 dead? If not realy, perhaps you can bring osd.1 up, to solve the down+incomplete and if this work you can reweigt osd.1 to 0 to move all data to other disks.

Or you try an repair?

Udo
 
Hi,
the problem is the down+incomplete.
Is the output actual?
Code:
9.11f   0       0       0       0       0       4262    4262    down+incomplete 2015-01-19 04:49:51.030072      65870'247873    77987:1237      [1,10,11]       1       [1,10,11]       1       63054'241792    2015-01-16 07:29:48.680838      61530'214476    2015-01-10 04:22:49.251984
The output is actual.

Is osd.1 dead? If not realy, perhaps you can bring osd.1 up, to solve the down+incomplete and if this work you can reweigt osd.1 to 0 to move all data to other disks.

#ceph osd tree
Code:
-9      9.05            host CA-00-01-01-17
0       1.81                    osd.0   up      1
2       1.81                    osd.2   up      1
4       1.81                    osd.4   up      1
9       1.81                    osd.9   up      1
12      1.81                    osd.12  up      1
-10     7.24            host CA-00-01-01-18
13      1.81                    osd.13  up      1
10      1.81                    osd.10  up      1
19      1.81                    osd.19  up      1
20      1.81                    osd.20  up      1
-11     9.05            host CA-00-01-01-19
6       1.81                    osd.6   up      1
7       1.81                    osd.7   up      1
8       1.81                    osd.8   up      1
11      1.81                    osd.11  up      1
14      1.81                    osd.14  up      1
-12     0               host CA-00-01-01-20
-13     9.05            host CA-00-01-01-21
1       1.81                    osd.1   up      1
21      1.81                    osd.21  up      1
22      1.81                    osd.22  up      1
23      1.81                    osd.23  up      1
5       1.81                    osd.5   up      1

Repairing PGs has no effect. Is there way to remove or recontruct these PGs? At this moment i am willing to take some data loss whatever attached to these 3 PGs. Just need the cluster accessible.
I also tried to export VM images from RBD pool. Connection to RBD storage drops after 1-2% and just sits there. For couple minutes cluster shows good Read, then quiet.
 
Hi,
the problem is the down+incomplete.
Is the output actual?
Code:
9.11f   0       0       0       0       0       4262    4262    down+incomplete 2015-01-19 04:49:51.030072      65870'247873    77987:1237      [1,10,11]       1       [1,10,11]       1       63054'241792    2015-01-16 07:29:48.680838      61530'214476    2015-01-10 04:22:49.251984
The output is actual.



#ceph osd tree
Code:
-9      9.05            host CA-00-01-01-17
0       1.81                    osd.0   up      1
2       1.81                    osd.2   up      1
4       1.81                    osd.4   up      1
9       1.81                    osd.9   up      1
12      1.81                    osd.12  up      1
-10     7.24            host CA-00-01-01-18
13      1.81                    osd.13  up      1
10      1.81                    osd.10  up      1
19      1.81                    osd.19  up      1
20      1.81                    osd.20  up      1
-11     9.05            host CA-00-01-01-19
6       1.81                    osd.6   up      1
7       1.81                    osd.7   up      1
8       1.81                    osd.8   up      1
11      1.81                    osd.11  up      1
14      1.81                    osd.14  up      1
-12     0               host CA-00-01-01-20
-13     9.05            host CA-00-01-01-21
1       1.81                    osd.1   up      1
21      1.81                    osd.21  up      1
22      1.81                    osd.22  up      1
23      1.81                    osd.23  up      1
5       1.81                    osd.5   up      1

Repairing PGs has no effect. Is there way to remove or recontruct these PGs? At this moment i am willing to take some data loss whatever attached to these 3 PGs. Just need the cluster accessible.
I also tried to export VM images from RBD pool. Connection to RBD storage drops after 1-2% and just sits there. For couple minutes cluster shows good Read, then quiet.
Hi,
strange... all OSDs up.

I had an simmiliar issue, where ceph stops rebuild after backfill_too_full. Stop and starting ceph on one osd-node bring all back.
So I would try to stop ceph on CA-00-01-01-21 (as beginning), start there again and look if anything rebuilded/changed. After that try the next node.

ceph -s shows on all mons the same pgmap/monmap?

Udo
 
Hi,
strange... all OSDs up.

I had an simmiliar issue, where ceph stops rebuild after backfill_too_full. Stop and starting ceph on one osd-node bring all back.
So I would try to stop ceph on CA-00-01-01-21 (as beginning), start there again and look if anything rebuilded/changed. After that try the next node.

ceph -s shows on all mons the same pgmap/monmap?

Udo
Rebooted each Ceph node one by one, no effect. Repair, Force_Recreate, Scrub of PG also didnt do anything. Any way i can remove these 3 PGs forcefully?

Trying to export the VM images off the RBD storage. But the download hangs after 1%. As soon as i start copying the image i can see Ceph read jumps to 20-25mb/s. Then after a minute or two it stops and read goes back to 0 bytes. If i can extract some of the VM images, i will forget about the ceph cluster and recreate from scratch.

From proxmox gui i can see the RBD Storage stats such as total size, graph etc under Summary. When click on Content after seconds it shows Connection Error.
 
Last edited:
Rebooted each Ceph node one by one, no effect. Repair, Force_Recreate, Scrub of PG also didnt do anything. Any way i can remove these 3 PGs forcefully?

Trying to export the VM images off the RBD storage. But the download hangs after 1%. As soon as i start copying the image i can see Ceph read jumps to 20-25mb/s. Then after a minute or two it stops and read goes back to 0 bytes. If i can extract some of the VM images, i will forget about the ceph cluster and recreate from scratch.

From proxmox gui i can see the RBD Storage stats such as total size, graph etc under Summary. When click on Content after seconds it shows Connection Error.
Hi Wasim,
find the corrupted VM-disks.

if you remove the files, the placementgroup schould be healthy again.

To find the vm-files:
Code:
find /var/lib/ceph/osd/ceph-1/current/9.11f_head/ | grep "\\udata\." | awk -F . '{print $3}' | sort -u
gives an output over all prefixes uses in the PG.

A sample how get this together:
Code:
find /var/lib/ceph/osd/ceph-1/current/2.6a_head/ | grep "\\udata\." | awk -F . '{print $3}' | sort -u
1506c62ae8944a
257a22ae8944a
3e4efa2ae8944a
5819a74b0dc51

## pool is in this case rbd
rbd -p rbd ls 
vm-245-disk-1
vm-245-disk-2
vm-499-disk-1
vm-499-disk-2

## you see - it's ugly: parts of all disks are in this one PG
rbd info rbd/vm-245-disk-1
rbd image 'vm-245-disk-1':
        size 3072 GB in 786432 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.257a22ae8944a
        format: 2
        features: layering
block_name_prefix: rbd_data.257a22ae8944a match the output of the find.

BTW. you can try to set reweigt from osd.1 to 0 and after rebalance only PG 9.11f should on the OSD. After that you can say ceph, that the OSD is lost... Don't know if this helps, because repair don't work for me either as I had trouble with one defective PG (but this error was gone after deleting benchmark-data (the test pool) where the problematic PG-Data came from).

Udo
 
Last edited:
I found the vm image that might be contributing into this issue. But i cannot seem to delete it. Everytime i start deletion it roughly deletes upto 12% than hangs.
 
I found the vm image that might be contributing into this issue. But i cannot seem to delete it. Everytime i start deletion it roughly deletes upto 12% than hangs.
Hi,
have you tried to delete the 4MB-Chunks via rados?

Like this for an vm-disk with prefix 44d4262ae8944a
Code:
rados rm -p rbd rbd_data.44d4262ae8944a.0000000000000466
...
Udo
 
Hi,
have you tried to delete the 4MB-Chunks via rados?

Like this for an vm-disk with prefix 44d4262ae8944a
Code:
rados rm -p rbd rbd_data.44d4262ae8944a.0000000000000466
...
No go Udo. :(
I cannot even make Ceph show me a list of all the objects for the VM image. It gets stuck half way scrolling header prefixes.

An executive decision has been made. We are leaving this Ceph pool behind. We already restored all VMs from recent backups and move 3 good images from Ceph cluster. But i would not say it was a total loss at all. The amount of insight and commands that i documented will go a very long way tackle future similar issues. I was not even aware half of the commands i had to punch in to fix this PG issue.
Luckily it happened on one of the less "important" production cluster. It could have been worse. :)
Thank you Udo for all the help. It was extremely valuable.

I created several pools on this damaged cluster and connected them with proxmox. I can access the pools absolutely no issue. Store VM images, delete etc. So the problem is with this particular Ceph pool itself. We are just going to delete it and move on. I think from now on we will take full advantage of Pools in Ceph. Create separate pools for different nature/requirements of VMs.

Somewhere in this thread Udo you have pointed out that taking out 2 HDDs at the same time for Replica 3 was a NO NO. What about an incident where 2 HDDs fails simultaneously? Wouldnt it heal itself? Or say entire node faills and it takes several days to get it up again. Wouldnt the data be safe without causing this sort of PG issue?

On the new pools i have set Replica 3 with minimum size 2. What do you think?
 
No go Udo. :(
I cannot even make Ceph show me a list of all the objects for the VM image. It gets stuck half way scrolling header prefixes.
what a shame.
I think from now on we will take full advantage of Pools in Ceph. Create separate pools for different nature/requirements of VMs.
perhaps also not an bad idea for me, to use more than one production pools...
Somewhere in this thread Udo you have pointed out that taking out 2 HDDs at the same time for Replica 3 was a NO NO.
Hi,
this was an misunderstanding. I wrote "more than two OSDs".
What about an incident where 2 HDDs fails simultaneously? Wouldnt it heal itself? Or say entire node faills and it takes several days to get it up again. Wouldnt the data be safe without causing this sort of PG issue?
in theory this can't happens... but you see it happens. With replica 3 two OSDs can die and the cluster can rebuild.
Perhaps you should not erase your pool to fast. Try to find help on the ceph-user mailingliste (ok, ofter there are not so much response on this list).
On the new pools i have set Replica 3 with minimum size 2. What do you think?
hmm,
for safety reasons perhaps not bad, but I'm not sure that your issue was impossible with this setting. The cluster will block writes after the second disk.outage... e.g. your pve-cluster hang too. If after rebuild all all works like before, it's perhaps ok...

I think I would stay with minimum size 1.

Udo
 
I think I would stay with minimum size 1.
Perhaps i have wrong idea about the minimum size. What is it exactly?
What i understand as of now this is the minimum size that cluster will acknowledge write operation. so with min size of 2 cluster ensures that there are always 2 replicas. With min size 1 if a single copy dies somehow as in my case, there are no more replica left to rebuild.

I am thinking of couple other clusters we are managing which are destined to grow upto petabyte. With so many osds and nodes min size greater than 1 makes sense. They are sitting on ubuntu but in the plan to move to proxmox.
 
Last edited:
hello udo and wasim
wasim if i understand correctly, you have 3 nodes with 4 osds per node ?
it is necessary replica 1 than 2?
if two osd turn out the same time that means you loose data??
as remember well inside documentation ceph it is necessary to delete manually these disk and replace with brand new disks and put it on production.

by the way there is something that i must now ?
i ask because i prepare to covert all my data center to ceph storage.
thanks both
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!