Promox + Ceph: VMs won't start + dreadful performance

justSem · May 10, 2024

I have been searching far and wide for this. But I can't seem to find a solution.
In one of our testing clusters we're experimenting with Ceph. But so far the journey has been bumpy at best.

The actual trigger for this post is the fact that one of our VMs locked up and I can't seem to get it to reboot.
qm just times out and I can't seem to get any useful logging from the cluster or ceph.

To outline our setup:
We're running 3 nodes with a Ryzen 7950x and 128GB of RAM.
All nodes have both a 1Gbit/s and 10Gbit/s NIC of which the latter is used for Ceph and migration traffic. The 10Gbit network has an MTU of 9000.
All storage is NVMe (PCIe4.0). Each node contains 3x 2TB NVMe drives (WD).

ceph health detail

Code:

HEALTH_WARN Reduced data availability: 1 pg inactive; 256 slow ops, oldest one blocked for 607418 sec, osd.8 has slow ops
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
    pg 1.0 is stuck inactive for 7d, current state unknown, last acting []
[WRN] SLOW_OPS: 256 slow ops, oldest one blocked for 607418 sec, osd.8 has slow ops

ceph -s

Code:

root@pve001:~# ceph -s
  cluster:
    id:     45c6e495-0fd6-48fe-8df9-90018537a237
    health: HEALTH_WARN
            Reduced data availability: 1 pg inactive
            256 slow ops, oldest one blocked for 607439 sec, osd.8 has slow ops
 
  services:
    mon: 3 daemons, quorum pve001,pve002,pve003 (age 7d)
    mgr: pve002(active, since 7d), standbys: pve001, pve003
    osd: 9 osds: 9 up (since 7d), 9 in (since 5w)
 
  data:
    pools:   2 pools, 129 pgs
    objects: 523.53k objects, 2.0 TiB
    usage:   4.0 TiB used, 12 TiB / 16 TiB avail
    pgs:     0.775% pgs unknown
             128 active+clean
             1   unknown

ceph df

Code:

root@pve001:~# ceph df
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
nvme   16 TiB  12 TiB  4.0 TiB   4.0 TiB      24.21
TOTAL  16 TiB  12 TiB  4.0 TiB   4.0 TiB      24.21
 
--- POOLS ---
POOL  ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr   1    1      0 B        0      0 B      0    5.3 TiB
ceph   4  128  2.0 TiB  523.53k  3.9 TiB  26.96    5.3 TiB

ceph osd df

Code:

root@pve001:~# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS
 6   nvme  1.81940   1.00000  1.8 TiB  397 GiB  395 GiB   21 KiB  2.2 GiB  1.4 TiB  21.32  0.88   25      up
 7   nvme  1.81940   1.00000  1.8 TiB  522 GiB  521 GiB   21 KiB  1.7 GiB  1.3 TiB  28.04  1.16   33      up
 8   nvme  1.81940   1.00000  1.8 TiB  430 GiB  428 GiB   28 KiB  1.6 GiB  1.4 TiB  23.07  0.95   27      up
 3   nvme  1.81940   1.00000  1.8 TiB  475 GiB  473 GiB   22 KiB  1.4 GiB  1.4 TiB  25.48  1.05   30      up
 4   nvme  1.81940   1.00000  1.8 TiB  444 GiB  442 GiB   27 KiB  1.7 GiB  1.4 TiB  23.84  0.98   28      up
 5   nvme  1.81940   1.00000  1.8 TiB  461 GiB  459 GiB   22 KiB  1.7 GiB  1.4 TiB  24.72  1.02   29      up
 0   nvme  1.81940   1.00000  1.8 TiB  413 GiB  412 GiB   18 KiB  1.5 GiB  1.4 TiB  22.19  0.92   26      up
 1   nvme  1.81940   1.00000  1.8 TiB  553 GiB  550 GiB   27 KiB  2.1 GiB  1.3 TiB  29.66  1.23   35      up
 2   nvme  1.81940   1.00000  1.8 TiB  365 GiB  363 GiB   23 KiB  1.7 GiB  1.5 TiB  19.58  0.81   23      up
                       TOTAL   16 TiB  4.0 TiB  3.9 TiB  214 KiB   16 GiB   12 TiB  24.21                   
MIN/MAX VAR: 0.81/1.23  STDDEV: 3.01

ceph osd tree

Code:

root@pve001:~# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS
 6   nvme  1.81940   1.00000  1.8 TiB  397 GiB  395 GiB   21 KiB  2.2 GiB  1.4 TiB  21.32  0.88   25      up
 7   nvme  1.81940   1.00000  1.8 TiB  522 GiB  521 GiB   21 KiB  1.7 GiB  1.3 TiB  28.04  1.16   33      up
 8   nvme  1.81940   1.00000  1.8 TiB  430 GiB  428 GiB   28 KiB  1.6 GiB  1.4 TiB  23.07  0.95   27      up
 3   nvme  1.81940   1.00000  1.8 TiB  475 GiB  473 GiB   22 KiB  1.4 GiB  1.4 TiB  25.48  1.05   30      up
 4   nvme  1.81940   1.00000  1.8 TiB  444 GiB  442 GiB   27 KiB  1.7 GiB  1.4 TiB  23.84  0.98   28      up
 5   nvme  1.81940   1.00000  1.8 TiB  461 GiB  459 GiB   22 KiB  1.7 GiB  1.4 TiB  24.72  1.02   29      up
 0   nvme  1.81940   1.00000  1.8 TiB  413 GiB  412 GiB   18 KiB  1.5 GiB  1.4 TiB  22.19  0.92   26      up
 1   nvme  1.81940   1.00000  1.8 TiB  553 GiB  550 GiB   27 KiB  2.1 GiB  1.3 TiB  29.66  1.23   35      up
 2   nvme  1.81940   1.00000  1.8 TiB  365 GiB  363 GiB   23 KiB  1.7 GiB  1.5 TiB  19.58  0.81   23      up
                       TOTAL   16 TiB  4.0 TiB  3.9 TiB  214 KiB   16 GiB   12 TiB  24.21                   
MIN/MAX VAR: 0.81/1.23  STDDEV: 3.01

I'm fairly new to ceph and to be honest I'm not entirely sure how to debug this. I was hoping someone could guide me along

beisser · May 10, 2024

let me guess, the wd-ssds are consumer models? probably not the fastest ones either?
can you post the exact model numbers you are using?

as a rule of thumb: consumer ssds dont perform well in server workloads as they are not designed to perform fast for prolonged periods of time. usually consumer ssds are plenty fast for as long as their pseudo-slc cache lasts and then the speed breaks down. in case of qlc ssds below the speed of harddrives.

additionally consumer ssd's dont have powerloss protection, which means they cannot cache any sync-writes at all, which reduces their performance to abysmal levels under certain circumstances.

in your case ceph is complaining that at least one of your osd's is having too slow performance.

my guess is that the ssd-model you chose simply isnt up to the task, but we will only be able to say more once we know what exactly you are using.

justSem · May 10, 2024

beisser said:
let me guess, the wd-ssds are consumer models? probably not the fastest ones either?
can you post the exact model numbers you are using?

as a rule of thumb: consumer ssds dont perform well in server workloads as they are not designed to perform fast for prolonged periods of time. usually consumer ssds are plenty fast for as long as their pseudo-slc cache lasts and then the speed breaks down. in case of qlc ssds below the speed of harddrives.

additionally consumer ssd's dont have powerloss protection, which means they cannot cache any sync-writes at all, which reduces their performance to abysmal levels under certain circumstances.

in your case ceph is complaining that at least one of your osd's is having too slow performance.

my guess is that the ssd-model you chose simply isnt up to the task, but we will only be able to say more once we know what exactly you are using.

Yeah, you are correct. They are WD SN770's. Honestly, with this being a testing environment and all we didn't consider this to be making such an impact. Afaik these 770's don't have any cache at all.

BenediktS · May 10, 2024

And i would try to restart the OSD.8 .

justSem · May 10, 2024

BenediktS said:
And i would try to restart the OSD.8 .

Restarting the OSDs certainly restored the throughput for now. Though I fear that this will become a common occurence.
I was hoping we could have some clustered storage without having to rely on an external SAN, but it seems I'm gonna need to get better disks for that

jsterr · May 10, 2024

Can you share a pvereport?

justSem · May 10, 2024

jsterr said:
Can you share a pvereport?

Certainly. i've attached it to this message.
Some details have been redacted.

jsterr · May 10, 2024

Your ceph network is running over enp1s0, are you sure this is 10Gbit?

Just in case you dont know: never use size: 2, minsize:2 you cant loose a single server with that setting.
And yes as already stated, only use enterprise ssds with powerloss protection.

According to your lspci: your also using non-enterprise ethernet adapters.
If you actually want to know what does cause your issue:

check the network (iperf/iperf3)
check ceph (pool, osds, latency) all commands in docs below

https://www.thomas-krenn.com/de/wiki/Ceph_Perfomance_Guide_-_Sizing_&_Testing (autotranslate should help)

justSem · May 10, 2024

jsterr said:
Your ceph network is running over enp1s0 which is only 1000 mbit acording to ip a Just in case you dont know: never use size: 2, minsize:2 you cant loose a single server with that setting.

And yes as already stated, only use enterprise ssds with powerloss protection.

I ran into that in the past.. then reconfigured it. I'm not sure why it's reporting 1000Mbit/s on the host. I triple checked it just now and my switch (Aruba) is reporting 10Gbit/s. Which also alings with the data transfer I see over that NIC (which is close to 10Gbit/s if I run a quick SCP to test).

Anyhow. Since I'm picking brains anyways. With this set of hardware, is there any good distributed storage solution, or is the solution simply: Use enterprise disks?

jsterr · May 10, 2024

justSem said:
I ran into that in the past.. then reconfigured it. I'm not sure why it's reporting 1000Mbit/s on the host. I triple checked it just now and my switch (Aruba) is reporting 10Gbit/s. Which also alings with the data transfer I see over that NIC (which is close to 10Gbit/s if I run a quick SCP to test).

Anyhow. Since I'm picking brains anyways. With this set of hardware, is there any good distributed storage solution, or is the solution simply: Use enterprise disks?

Sorry I mislooked and edited my comment. Check above, would go for osd perf and benchmark (see link). if the disks are bad, you can see it by doing some benchmarks on them and checking the latency while benchmarking (ceph osd perf).

You can check the linkspeed with ethtool enp1s0

justSem · May 10, 2024

jsterr said:
Sorry I mislooked and edited my comment. Check above, would go for osd perf and benchmark (see link). if the disks are bad, you can see it by doing some benchmarks on them and checking the latency while benchmarking (ceph osd perf).

You can check the linkspeed with ethtool enp1s0

All right. I'm gonna do some testing to see what's up. Thank you so much for your advice so far

Search

Search

Promox + Ceph: VMs won't start + dreadful performance

justSem

New Member

beisser

Active Member

justSem

New Member

BenediktS

Member

justSem

New Member

jsterr

Well-Known Member

justSem

New Member

Attachments

jsterr

Well-Known Member

justSem

New Member

jsterr

Well-Known Member

justSem

New Member