VM issues when stopping OSDs

Brad22 · Feb 8, 2020

Hello,

Sorry if this has been covered before but I can't seem to find a solution.

I have a 5 node Proxmox (5.4) cluster running CEPH. 3 nodes with 3xOSD and 2 nodes with 2 OSD (To be increased shortly). When a server goes offline or I shutdown 2 OSDs some of my VMs run into issues (Nothing in the VM logs but webpages stop loading etc). I have also tried setting noout prior. All OSDs are 1.7TB PM883s

The VMs in question are running on the "ceph-ssd" pool (ceph-nvme is for testing)

If anyone has any suggestions it would be greatly appreciated. I want to upgrade to Proxmox 6 and Nautilus however do not feel confident with this happening.

Crushmap https://pastebin.com/raw/uG6PACxv
Logs https://pastebin.com/raw/mkDPVTtQ
Config https://pastebin.com/raw/P22t4phw

root@hv2:~# pveceph pool ls
Name size min_size pg_num %-used used
ceph-nvme 2 1 128 18.01 73815558239
ceph-ssd 3 2 256 55.27 3844619901849
root@hv2:~#

root@hv2:/var/log/ceph# ceph -s
cluster:
id: 40b9a33d-25c9-42b8-aa49-5a73c4bfa879
health: HEALTH_OK

services:
mon: 3 daemons, quorum hv2,hv4,hv5
mgr: hv2(active), standbys: hv5, hv4
osd: 17 osds: 17 up, 17 in

data:
pools: 2 pools, 384 pgs
objects: 939.60k objects, 3.57TiB
usage: 10.6TiB used, 13.0TiB / 23.6TiB avail
pgs: 384 active+clean

io:
client: 58.1MiB/s rd, 10.4MiB/s wr, 1.36kop/s rd, 238op/s wr

Is my pg_num too low? Should I set min_size to 1? Do I not have enough OSDs? Is it because they're uneven?

Thanks

Alwin · Feb 10, 2020

Brad22 said:
Is my pg_num too low? Should I set min_size to 1? Do I not have enough OSDs? Is it because they're uneven?

No.
step choose firstn 0 type osd
It is because every (besides the default) rule distributes on OSD level instead of host.

Brad22 · Feb 10, 2020

Not sure how I missed that. Thank you very much.

Alwin · Feb 10, 2020

Brad22 said:
Not sure how I missed that. Thank you very much.

So, and now.

Brad22 said:
Is my pg_num too low? Should I set min_size to 1? Do I not have enough OSDs? Is it because they're uneven?

You need to re-check the PG count [0], it should be roughly twice as much. Min_size 1 is very dangerous, as PGs in-flight might be lost, if subsequent failures arise. OSDs should best be even balanced for better distribution.

[0] https://ceph.io/pgcalc/

Brad22 · Feb 10, 2020

Thanks. It's currently set to 2/3 with a PG of 512. The existing crush map certainly explains what I was seeing.

Search

Search

VM issues when stopping OSDs

Brad22

Active Member

Alwin

Proxmox Retired Staff

Brad22

Active Member

Alwin

Proxmox Retired Staff

Brad22

Active Member