CEPH testing issue with Proxmox 4.4

jeffwadsworth · Jan 5, 2017

Has anyone tested a failure with CEPH 0.94.9 using 3 replications under 4.4? I have a test environment set up using 4 proxmox 4.4 nodes (clustered) each with 2 OSD's (8 total). Each OSD is 256GB. 3 monitors. A pool named test is configured for 3/3. 256 PG's. 3 VM's using 100 GB RAW disks each are in use.
After CEPH is noted as Healthy, I pull 2 OSD's on any 2 random nodes. After a bit of repairing, 21 PG's are undersized (not enough replica's exist). Just curious if anyone has seen similar results in testing.

OSDs
In Out
Up 6 0
Down 0 2
Total: 8

PGs
active+clean:
226
active+remapped:
9
undersized+degraded+peered:
21

Ashley · Jan 5, 2017

What is your crush map look like?

jeffwadsworth · Jan 5, 2017

Ashley said:
What is your crush map look like?

# begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable straw_calc_version 1 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host proxmox1 { id -2 # do not change unnecessarily # weight 0.440 alg straw hash 0 # rjenkins1 item osd.0 weight 0.220 item osd.2 weight 0.220 } host proxmox2 { id -3 # do not change unnecessarily # weight 0.440 alg straw hash 0 # rjenkins1 item osd.3 weight 0.220 item osd.1 weight 0.220 } host proxmox3 { id -4 # do not change unnecessarily # weight 0.440 alg straw hash 0 # rjenkins1 item osd.4 weight 0.220 item osd.5 weight 0.220 } host proxmox4 { id -5 # do not change unnecessarily # weight 0.440 alg straw hash 0 # rjenkins1 item osd.6 weight 0.220 item osd.7 weight 0.220 } root default { id -1 # do not change unnecessarily # weight 1.760 alg straw hash 0 # rjenkins1 item proxmox1 weight 0.440 item proxmox2 weight 0.440 item proxmox3 weight 0.440 item proxmox4 weight 0.440 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map

udo · Jan 5, 2017

Hi,
do you have any flags like noout active?

Is the cluster still recover?

What is the out put of following commands?

Code:

ceph osd tree

ceph -w | head -30

Udo

jeffwadsworth · Jan 5, 2017

I had to take it offline for today but will post ASAP. Thanks for the help.
To answer the quick questions:
No, noout is not active
No, it never recovered

jeffwadsworth · Jan 19, 2017

Ok, due to work matters, I never got back to connecting everything back up again, but I believe I know why this happened. My min_size is 3 and probably should have been 1. Due to only 1 replica being available, perhaps this affected the I/O in this case.

http://docs.ceph.com/docs/master/rados/operations/pools/#set-the-number-of-object-replicas

fabian · Jan 19, 2017

jeffwadsworth said:
Ok, due to work matters, I never got back to connecting everything back up again, but I believe I know why this happened. My min_size is 3 and probably should have been 1. Due to only 1 replica being available, perhaps this affected the I/O in this case.

http://docs.ceph.com/docs/master/rados/operations/pools/#set-the-number-of-object-replicas

min_size 1 is very dangerous - most setups run with min_size 2 and size >= 3

see e.g. this post (and the whole thread

): http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014872.html

Search

Search

CEPH testing issue with Proxmox 4.4

jeffwadsworth

Member

Ashley

Member

jeffwadsworth

Member

udo

Distinguished Member

jeffwadsworth

Member

jeffwadsworth

Member

fabian

Proxmox Staff Member