Strange CEPH behaviour - lagging

khostri

New Member
Feb 28, 2025
4
0
1
Hello community
Sorry for newbie post like this as I will probably get recomendation no to go that way. But I am still curious about cause of this issue

First of all, its home lab, single node Proxmox cluster.
My son bought second-hand Dell PowerEdge R730xd. Server has PERC H730P Mini Embeded controler in HBA mode.
Server is 2x Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz (32 cores, 16 physical), 128 GB Memory

There are several SSD disks:
3x Samsung 860 EVO in ZFS pool used to store as system disks for VMs
2x Intel 480G8 for some mounted data drives inside VM
2x Kingston 960G also for mounts

As i wanted to add drives one by one, but be able to see them as one storage inside VMs I was thinking about BTRFS or Ceph. I dont like ZFS as would need to add drives in initial group sizes.
Then I tried to create CephFS with
2x 2TB Dahua C800 drives + 1x Kingston 960GB drive and then added 1 Crucial CT2000BX drive (idea was to remove Kingston as it serves only as storage temporary for copying into CephFS)
It is my first experience with Ceph so far so maybe i have made some mistakes.
Changed rule from hosts to OSDs:
rule replicated_rule { id 0 type replicated step take default step chooseleaf firstn 0 type osd step emit}

have osd_pool_default_min_size = 2 osd_pool_default_size = 2

I had one major issue with my new drive being faulty, so I replaced it. also had issue with whole CephFS during first setup where I got permissions denied for keyring, config etc. So I removed everything completly (found post to do it) and created CephFS it from scratch using Proxmox gui.
I am connecting to CephFS from one mediaserver VM with kernel driver (not using fuse) and mounting is ok during boot as well.

I know, those are consumer grade drives, I have only one mon, mgr, mds, ie one node. So I was expecting slower transfers. But I am fighitng strange issue now:

Write operation are quite laggy - I mean copy starts for example at 200MiB for some time (like filling cache, then goes down. But problem is not its going down (even to like 30MiB), but then it stops for example completly or stays at 1000kiB, then 0, then 4MiB etc. Health is reporting OSD delays at that time.
Even rebalancing goes this slow. Reading is a lot better of course and seems without lags.
I have checked HBA mode, checked CPU utilization which is quite low, memory is available.
Not sure where to look
fio from Vm looks like this for 4k:
1740743767097.png
like its nearly not writting at all.
its from same time (I know rebalancing is in place)
1740743796677.png

System log has messages like this:
Feb 28 12:56:30 ms ceph-crash[2365]: 2025-02-28T12:56:30.380+0100 780c652006c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.admin.keyring: (13) Permission denied
Feb 28 12:56:30 ms ceph-crash[2365]: 2025-02-28T12:56:30.381+0100 780c652006c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.admin.keyring: (13) Permission denied
Feb 28 12:56:30 ms ceph-crash[2365]: 2025-02-28T12:56:30.381+0100 780c652006c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.admin.keyring: (13) Permission denied
Feb 28 12:56:30 ms ceph-crash[2365]: 2025-02-28T12:56:30.381+0100 780c652006c0 -1 monclient: keyring not found

But according to some posts this is normal and could be ignored
It was similar even when not rebalancing (I have added 2TB drive and removed 1TB Kingston - that is the reason for rebalancing. Kingston is out, but up)

in Ceph log there is nothing special, jsut scrolling info about recovery, but with those slow speeds:

services:
mon: 1 daemons, quorum ms (age 2d)
mgr: ms(active, since 2d)
mds: 1/1 daemons up
osd: 4 osds: 4 up (since 13h), 3 in (since 13h); 50 remapped pgs

data:
volumes: 1/1 healthy
pools: 3 pools, 137 pgs
objects: 452.84k objects, 1.7 TiB
usage: 3.0 TiB used, 2.5 TiB / 5.5 TiB avail
pgs: 176632/905684 objects misplaced (19.503%)
87 active+clean
49 active+remapped+backfill_wait
1 active+remapped+backfilling

io:
recovery: 1.3 MiB/s, 0 objects/s


Please do you have any tips or idea what is going on there?
 
Last edited:
  • Like
Reactions: gurubert
Thanks fro reply. As I have stated I am using ZFS for normal system disks. I have searched for something where I can add disks one by one ( and not having separate NAS box). This space is intended for media / data from docker stack. (not compose and app data itself, but only container data).

I like ZFS I just had problem with its behaviour to be force to add more disks in initial group size. Ie starting with 3 disks will force me to add another 3 disks as a batch and I cannot add only one more. That was main reason.
For that reason I was also checking btrfs which can convert raid levels and add disks, just noticed several issues on forums with that setup.
Maybe if you would have any other tip, it would be very nice :)

regarding cephfs issue - yesterday night I have emptied both Intel drives, add them as secondary cephfs-test. Set class to ssdt to be different from current drives (ssd), updated crush pak to separate both pools completly. Today I have tried bench with testing pool from linux and even from Windows and its totaly different situation - write is like 500 MiB / 380 MiB sequential (CrystalDisk mark + fio), random is like 25MiB. no lag

So it seems that even I have bought those 3 new 2TB drives, they are really useless - Dahua + even Crucial. So thinking if x16 adapter for 4x M.2 NVMe would be better. Probaly I will be able to return both Dahuas and sell crucial.
But still the question about filesystem remains :)
 
Last edited:
I like ZFS I just had problem with its behaviour to be force to add more disks in initial group size. Ie starting with 3 disks will force me to add another 3 disks as a batch
This has never been true. It was (and is) recommended, but not required.

and I cannot add only one more.
This was (technically) also never true. You could add a single new vdev with a single disk since ZFS had been invented. But if would actually do that you would lose all the redundancy other vdevs would have had established. So that was practically forbidden. And as you couldn't "undo" adding that single drive it was (and is if you have RaidZ involved!) dangerous to "just test it".

With ZFS 2.3 - which is not available in PVE yet - there is "RaidZ expansion", giving you more options in the future.
 
Last edited:
If you care about your data, buy second hand enterprise drives instead of consumer ones. The performance you are seeing is expected with those drives: once the drive SLC cache is full, writes are very slow.

Also, no sure if the overhead/complexity induced by Ceph in your single node setup is really worth unless you plan to expand that cluster to 3+ nodes soon. Also, keep in mind that even if Ceph will allow to add single disks, you should add them same size 2 by 2, to help Ceph distribute data in that 2/2 pool. I would not use that setup for anything but practice in a lab.

I would stick with ZFS or even LVM and good backups. Or just buy big enough disks! ;)
 
  • Like
Reactions: gurubert and UdoB
This has never been true. It was (and is) recommended, but not required.


This was (technically) also never true. You could add a single new vdev with a single disk since ZFS had been invented. But if would actually do that you would lose all the redundancy other vdevs would have had established. So that was practically forbidden. And as you couldn't "undo" adding that single drive it was (and is if you have RaidZ involved!) dangerous to "just test it".

With ZFS 2.3 - which is not available in PVE yet - there is "RaidZ expansion", giving you more options in the future.
Thanks. I am looking for it to be available in proxmox :) Currently I have destroyed ceph and removed it.
As my son's server is Delll R730xd with 24 2.5'' bays (SATa backplane) so originaly I was searching for solution to add 2.5SATA disks. Didnt wanted to change backplane and buy expensive Dell converters. But I will probably order Asus PCIe 4x M.2 NVMe adapter for x16 slot (there is graphic in second slot already for Tdarr and Jellyfin). and buy 4x 2 TB SSD to try it with ZFS. Have searched some cheap drives with DRAM cache and found that Apacer AS2280Q4 2TB should have 1GB DRAM Cache and has reasonable speed, so I will probably go that way.
 
Last edited: