VM issues when stopping OSDs

Brad22

Active Member
Jun 12, 2019
11
7
43
38
Hello,


Sorry if this has been covered before but I can't seem to find a solution.

I have a 5 node Proxmox (5.4) cluster running CEPH. 3 nodes with 3xOSD and 2 nodes with 2 OSD (To be increased shortly). When a server goes offline or I shutdown 2 OSDs some of my VMs run into issues (Nothing in the VM logs but webpages stop loading etc). I have also tried setting noout prior. All OSDs are 1.7TB PM883s

The VMs in question are running on the "ceph-ssd" pool (ceph-nvme is for testing)

If anyone has any suggestions it would be greatly appreciated. I want to upgrade to Proxmox 6 and Nautilus however do not feel confident with this happening.

Crushmap https://pastebin.com/raw/uG6PACxv
Logs https://pastebin.com/raw/mkDPVTtQ
Config https://pastebin.com/raw/P22t4phw

root@hv2:~# pveceph pool ls
Name size min_size pg_num %-used used
ceph-nvme 2 1 128 18.01 73815558239
ceph-ssd 3 2 256 55.27 3844619901849
root@hv2:~#

root@hv2:/var/log/ceph# ceph -s
cluster:
id: 40b9a33d-25c9-42b8-aa49-5a73c4bfa879
health: HEALTH_OK

services:
mon: 3 daemons, quorum hv2,hv4,hv5
mgr: hv2(active), standbys: hv5, hv4
osd: 17 osds: 17 up, 17 in

data:
pools: 2 pools, 384 pgs
objects: 939.60k objects, 3.57TiB
usage: 10.6TiB used, 13.0TiB / 23.6TiB avail
pgs: 384 active+clean

io:
client: 58.1MiB/s rd, 10.4MiB/s wr, 1.36kop/s rd, 238op/s wr



Is my pg_num too low? Should I set min_size to 1? Do I not have enough OSDs? Is it because they're uneven?

Thanks
 
Last edited:
Is my pg_num too low? Should I set min_size to 1? Do I not have enough OSDs? Is it because they're uneven?
No.
step choose firstn 0 type osd
It is because every (besides the default) rule distributes on OSD level instead of host.
 
Not sure how I missed that. Thank you very much.
So, and now. :)

Is my pg_num too low? Should I set min_size to 1? Do I not have enough OSDs? Is it because they're uneven?
You need to re-check the PG count [0], it should be roughly twice as much. Min_size 1 is very dangerous, as PGs in-flight might be lost, if subsequent failures arise. OSDs should best be even balanced for better distribution.

[0] https://ceph.io/pgcalc/
 
Thanks. It's currently set to 2/3 with a PG of 512. The existing crush map certainly explains what I was seeing.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!