Promox + Ceph: VMs won't start + dreadful performance

justSem

New Member
May 10, 2024
6
2
1
I have been searching far and wide for this. But I can't seem to find a solution.
In one of our testing clusters we're experimenting with Ceph. But so far the journey has been bumpy at best.

The actual trigger for this post is the fact that one of our VMs locked up and I can't seem to get it to reboot.
qm just times out and I can't seem to get any useful logging from the cluster or ceph.

To outline our setup:
We're running 3 nodes with a Ryzen 7950x and 128GB of RAM.
All nodes have both a 1Gbit/s and 10Gbit/s NIC of which the latter is used for Ceph and migration traffic. The 10Gbit network has an MTU of 9000.
All storage is NVMe (PCIe4.0). Each node contains 3x 2TB NVMe drives (WD).

ceph health detail
Code:
HEALTH_WARN Reduced data availability: 1 pg inactive; 256 slow ops, oldest one blocked for 607418 sec, osd.8 has slow ops
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
    pg 1.0 is stuck inactive for 7d, current state unknown, last acting []
[WRN] SLOW_OPS: 256 slow ops, oldest one blocked for 607418 sec, osd.8 has slow ops

ceph -s
Code:
root@pve001:~# ceph -s
  cluster:
    id:     45c6e495-0fd6-48fe-8df9-90018537a237
    health: HEALTH_WARN
            Reduced data availability: 1 pg inactive
            256 slow ops, oldest one blocked for 607439 sec, osd.8 has slow ops
 
  services:
    mon: 3 daemons, quorum pve001,pve002,pve003 (age 7d)
    mgr: pve002(active, since 7d), standbys: pve001, pve003
    osd: 9 osds: 9 up (since 7d), 9 in (since 5w)
 
  data:
    pools:   2 pools, 129 pgs
    objects: 523.53k objects, 2.0 TiB
    usage:   4.0 TiB used, 12 TiB / 16 TiB avail
    pgs:     0.775% pgs unknown
             128 active+clean
             1   unknown

ceph df
Code:
root@pve001:~# ceph df
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
nvme   16 TiB  12 TiB  4.0 TiB   4.0 TiB      24.21
TOTAL  16 TiB  12 TiB  4.0 TiB   4.0 TiB      24.21
 
--- POOLS ---
POOL  ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr   1    1      0 B        0      0 B      0    5.3 TiB
ceph   4  128  2.0 TiB  523.53k  3.9 TiB  26.96    5.3 TiB

ceph osd df
Code:
root@pve001:~# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS
 6   nvme  1.81940   1.00000  1.8 TiB  397 GiB  395 GiB   21 KiB  2.2 GiB  1.4 TiB  21.32  0.88   25      up
 7   nvme  1.81940   1.00000  1.8 TiB  522 GiB  521 GiB   21 KiB  1.7 GiB  1.3 TiB  28.04  1.16   33      up
 8   nvme  1.81940   1.00000  1.8 TiB  430 GiB  428 GiB   28 KiB  1.6 GiB  1.4 TiB  23.07  0.95   27      up
 3   nvme  1.81940   1.00000  1.8 TiB  475 GiB  473 GiB   22 KiB  1.4 GiB  1.4 TiB  25.48  1.05   30      up
 4   nvme  1.81940   1.00000  1.8 TiB  444 GiB  442 GiB   27 KiB  1.7 GiB  1.4 TiB  23.84  0.98   28      up
 5   nvme  1.81940   1.00000  1.8 TiB  461 GiB  459 GiB   22 KiB  1.7 GiB  1.4 TiB  24.72  1.02   29      up
 0   nvme  1.81940   1.00000  1.8 TiB  413 GiB  412 GiB   18 KiB  1.5 GiB  1.4 TiB  22.19  0.92   26      up
 1   nvme  1.81940   1.00000  1.8 TiB  553 GiB  550 GiB   27 KiB  2.1 GiB  1.3 TiB  29.66  1.23   35      up
 2   nvme  1.81940   1.00000  1.8 TiB  365 GiB  363 GiB   23 KiB  1.7 GiB  1.5 TiB  19.58  0.81   23      up
                       TOTAL   16 TiB  4.0 TiB  3.9 TiB  214 KiB   16 GiB   12 TiB  24.21                   
MIN/MAX VAR: 0.81/1.23  STDDEV: 3.01

ceph osd tree

Code:
root@pve001:~# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS
 6   nvme  1.81940   1.00000  1.8 TiB  397 GiB  395 GiB   21 KiB  2.2 GiB  1.4 TiB  21.32  0.88   25      up
 7   nvme  1.81940   1.00000  1.8 TiB  522 GiB  521 GiB   21 KiB  1.7 GiB  1.3 TiB  28.04  1.16   33      up
 8   nvme  1.81940   1.00000  1.8 TiB  430 GiB  428 GiB   28 KiB  1.6 GiB  1.4 TiB  23.07  0.95   27      up
 3   nvme  1.81940   1.00000  1.8 TiB  475 GiB  473 GiB   22 KiB  1.4 GiB  1.4 TiB  25.48  1.05   30      up
 4   nvme  1.81940   1.00000  1.8 TiB  444 GiB  442 GiB   27 KiB  1.7 GiB  1.4 TiB  23.84  0.98   28      up
 5   nvme  1.81940   1.00000  1.8 TiB  461 GiB  459 GiB   22 KiB  1.7 GiB  1.4 TiB  24.72  1.02   29      up
 0   nvme  1.81940   1.00000  1.8 TiB  413 GiB  412 GiB   18 KiB  1.5 GiB  1.4 TiB  22.19  0.92   26      up
 1   nvme  1.81940   1.00000  1.8 TiB  553 GiB  550 GiB   27 KiB  2.1 GiB  1.3 TiB  29.66  1.23   35      up
 2   nvme  1.81940   1.00000  1.8 TiB  365 GiB  363 GiB   23 KiB  1.7 GiB  1.5 TiB  19.58  0.81   23      up
                       TOTAL   16 TiB  4.0 TiB  3.9 TiB  214 KiB   16 GiB   12 TiB  24.21                   
MIN/MAX VAR: 0.81/1.23  STDDEV: 3.01

I'm fairly new to ceph and to be honest I'm not entirely sure how to debug this. I was hoping someone could guide me along :)
 
let me guess, the wd-ssds are consumer models? probably not the fastest ones either?
can you post the exact model numbers you are using?

as a rule of thumb: consumer ssds dont perform well in server workloads as they are not designed to perform fast for prolonged periods of time. usually consumer ssds are plenty fast for as long as their pseudo-slc cache lasts and then the speed breaks down. in case of qlc ssds below the speed of harddrives.

additionally consumer ssd's dont have powerloss protection, which means they cannot cache any sync-writes at all, which reduces their performance to abysmal levels under certain circumstances.

in your case ceph is complaining that at least one of your osd's is having too slow performance.

my guess is that the ssd-model you chose simply isnt up to the task, but we will only be able to say more once we know what exactly you are using.
 
  • Like
Reactions: Kingneutron
let me guess, the wd-ssds are consumer models? probably not the fastest ones either?
can you post the exact model numbers you are using?

as a rule of thumb: consumer ssds dont perform well in server workloads as they are not designed to perform fast for prolonged periods of time. usually consumer ssds are plenty fast for as long as their pseudo-slc cache lasts and then the speed breaks down. in case of qlc ssds below the speed of harddrives.

additionally consumer ssd's dont have powerloss protection, which means they cannot cache any sync-writes at all, which reduces their performance to abysmal levels under certain circumstances.

in your case ceph is complaining that at least one of your osd's is having too slow performance.

my guess is that the ssd-model you chose simply isnt up to the task, but we will only be able to say more once we know what exactly you are using.
Yeah, you are correct. They are WD SN770's. Honestly, with this being a testing environment and all we didn't consider this to be making such an impact. Afaik these 770's don't have any cache at all.
 
  • Like
Reactions: Kingneutron
And i would try to restart the OSD.8 .
Restarting the OSDs certainly restored the throughput for now. Though I fear that this will become a common occurence.
I was hoping we could have some clustered storage without having to rely on an external SAN, but it seems I'm gonna need to get better disks for that :)
 
  • Like
Reactions: Kingneutron
Can you share a pvereport?
 
Your ceph network is running over enp1s0, are you sure this is 10Gbit?

Just in case you dont know: never use size: 2, minsize:2 you cant loose a single server with that setting.
And yes as already stated, only use enterprise ssds with powerloss protection.

According to your lspci: your also using non-enterprise ethernet adapters.
If you actually want to know what does cause your issue:

  • check the network (iperf/iperf3)
  • check ceph (pool, osds, latency) all commands in docs below

https://www.thomas-krenn.com/de/wiki/Ceph_Perfomance_Guide_-_Sizing_&_Testing (autotranslate should help)
 
Last edited:
  • Like
Reactions: justSem
Your ceph network is running over enp1s0 which is only 1000 mbit acording to ip a Just in case you dont know: never use size: 2, minsize:2 you cant loose a single server with that setting.

And yes as already stated, only use enterprise ssds with powerloss protection.
I ran into that in the past.. then reconfigured it. I'm not sure why it's reporting 1000Mbit/s on the host. I triple checked it just now and my switch (Aruba) is reporting 10Gbit/s. Which also alings with the data transfer I see over that NIC (which is close to 10Gbit/s if I run a quick SCP to test).

Anyhow. Since I'm picking brains anyways. With this set of hardware, is there any good distributed storage solution, or is the solution simply: Use enterprise disks?
 
I ran into that in the past.. then reconfigured it. I'm not sure why it's reporting 1000Mbit/s on the host. I triple checked it just now and my switch (Aruba) is reporting 10Gbit/s. Which also alings with the data transfer I see over that NIC (which is close to 10Gbit/s if I run a quick SCP to test).

Anyhow. Since I'm picking brains anyways. With this set of hardware, is there any good distributed storage solution, or is the solution simply: Use enterprise disks?
Sorry I mislooked and edited my comment. Check above, would go for osd perf and benchmark (see link). if the disks are bad, you can see it by doing some benchmarks on them and checking the latency while benchmarking (ceph osd perf).

You can check the linkspeed with ethtool enp1s0
 
  • Like
Reactions: justSem
Sorry I mislooked and edited my comment. Check above, would go for osd perf and benchmark (see link). if the disks are bad, you can see it by doing some benchmarks on them and checking the latency while benchmarking (ceph osd perf).

You can check the linkspeed with ethtool enp1s0
All right. I'm gonna do some testing to see what's up. Thank you so much for your advice so far :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!