Hello community
Sorry for newbie post like this as I will probably get recomendation no to go that way. But I am still curious about cause of this issue
First of all, its home lab, single node Proxmox cluster.
My son bought second-hand Dell PowerEdge R730xd. Server has PERC H730P Mini Embeded controler in HBA mode.
Server is 2x Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz (32 cores, 16 physical), 128 GB Memory
There are several SSD disks:
3x Samsung 860 EVO in ZFS pool used to store as system disks for VMs
2x Intel 480G8 for some mounted data drives inside VM
2x Kingston 960G also for mounts
As i wanted to add drives one by one, but be able to see them as one storage inside VMs I was thinking about BTRFS or Ceph. I dont like ZFS as would need to add drives in initial group sizes.
Then I tried to create CephFS with
2x 2TB Dahua C800 drives + 1x Kingston 960GB drive and then added 1 Crucial CT2000BX drive (idea was to remove Kingston as it serves only as storage temporary for copying into CephFS)
It is my first experience with Ceph so far so maybe i have made some mistakes.
Changed rule from hosts to OSDs:
rule replicated_rule { id 0 type replicated step take default step chooseleaf firstn 0 type osd step emit}
have osd_pool_default_min_size = 2 osd_pool_default_size = 2
I had one major issue with my new drive being faulty, so I replaced it. also had issue with whole CephFS during first setup where I got permissions denied for keyring, config etc. So I removed everything completly (found post to do it) and created CephFS it from scratch using Proxmox gui.
I am connecting to CephFS from one mediaserver VM with kernel driver (not using fuse) and mounting is ok during boot as well.
I know, those are consumer grade drives, I have only one mon, mgr, mds, ie one node. So I was expecting slower transfers. But I am fighitng strange issue now:
Write operation are quite laggy - I mean copy starts for example at 200MiB for some time (like filling cache, then goes down. But problem is not its going down (even to like 30MiB), but then it stops for example completly or stays at 1000kiB, then 0, then 4MiB etc. Health is reporting OSD delays at that time.
Even rebalancing goes this slow. Reading is a lot better of course and seems without lags.
I have checked HBA mode, checked CPU utilization which is quite low, memory is available.
Not sure where to look
fio from Vm looks like this for 4k:

like its nearly not writting at all.
its from same time (I know rebalancing is in place)

System log has messages like this:
Feb 28 12:56:30 ms ceph-crash[2365]: 2025-02-28T12:56:30.380+0100 780c652006c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.admin.keyring: (13) Permission denied
Feb 28 12:56:30 ms ceph-crash[2365]: 2025-02-28T12:56:30.381+0100 780c652006c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.admin.keyring: (13) Permission denied
Feb 28 12:56:30 ms ceph-crash[2365]: 2025-02-28T12:56:30.381+0100 780c652006c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.admin.keyring: (13) Permission denied
Feb 28 12:56:30 ms ceph-crash[2365]: 2025-02-28T12:56:30.381+0100 780c652006c0 -1 monclient: keyring not found
But according to some posts this is normal and could be ignored
It was similar even when not rebalancing (I have added 2TB drive and removed 1TB Kingston - that is the reason for rebalancing. Kingston is out, but up)
in Ceph log there is nothing special, jsut scrolling info about recovery, but with those slow speeds:
services:
mon: 1 daemons, quorum ms (age 2d)
mgr: ms(active, since 2d)
mds: 1/1 daemons up
osd: 4 osds: 4 up (since 13h), 3 in (since 13h); 50 remapped pgs
data:
volumes: 1/1 healthy
pools: 3 pools, 137 pgs
objects: 452.84k objects, 1.7 TiB
usage: 3.0 TiB used, 2.5 TiB / 5.5 TiB avail
pgs: 176632/905684 objects misplaced (19.503%)
87 active+clean
49 active+remapped+backfill_wait
1 active+remapped+backfilling
io:
recovery: 1.3 MiB/s, 0 objects/s
Please do you have any tips or idea what is going on there?
Sorry for newbie post like this as I will probably get recomendation no to go that way. But I am still curious about cause of this issue
First of all, its home lab, single node Proxmox cluster.
My son bought second-hand Dell PowerEdge R730xd. Server has PERC H730P Mini Embeded controler in HBA mode.
Server is 2x Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz (32 cores, 16 physical), 128 GB Memory
There are several SSD disks:
3x Samsung 860 EVO in ZFS pool used to store as system disks for VMs
2x Intel 480G8 for some mounted data drives inside VM
2x Kingston 960G also for mounts
As i wanted to add drives one by one, but be able to see them as one storage inside VMs I was thinking about BTRFS or Ceph. I dont like ZFS as would need to add drives in initial group sizes.
Then I tried to create CephFS with
2x 2TB Dahua C800 drives + 1x Kingston 960GB drive and then added 1 Crucial CT2000BX drive (idea was to remove Kingston as it serves only as storage temporary for copying into CephFS)
It is my first experience with Ceph so far so maybe i have made some mistakes.
Changed rule from hosts to OSDs:
rule replicated_rule { id 0 type replicated step take default step chooseleaf firstn 0 type osd step emit}
have osd_pool_default_min_size = 2 osd_pool_default_size = 2
I had one major issue with my new drive being faulty, so I replaced it. also had issue with whole CephFS during first setup where I got permissions denied for keyring, config etc. So I removed everything completly (found post to do it) and created CephFS it from scratch using Proxmox gui.
I am connecting to CephFS from one mediaserver VM with kernel driver (not using fuse) and mounting is ok during boot as well.
I know, those are consumer grade drives, I have only one mon, mgr, mds, ie one node. So I was expecting slower transfers. But I am fighitng strange issue now:
Write operation are quite laggy - I mean copy starts for example at 200MiB for some time (like filling cache, then goes down. But problem is not its going down (even to like 30MiB), but then it stops for example completly or stays at 1000kiB, then 0, then 4MiB etc. Health is reporting OSD delays at that time.
Even rebalancing goes this slow. Reading is a lot better of course and seems without lags.
I have checked HBA mode, checked CPU utilization which is quite low, memory is available.
Not sure where to look
fio from Vm looks like this for 4k:

like its nearly not writting at all.
its from same time (I know rebalancing is in place)

System log has messages like this:
Feb 28 12:56:30 ms ceph-crash[2365]: 2025-02-28T12:56:30.380+0100 780c652006c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.admin.keyring: (13) Permission denied
Feb 28 12:56:30 ms ceph-crash[2365]: 2025-02-28T12:56:30.381+0100 780c652006c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.admin.keyring: (13) Permission denied
Feb 28 12:56:30 ms ceph-crash[2365]: 2025-02-28T12:56:30.381+0100 780c652006c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.admin.keyring: (13) Permission denied
Feb 28 12:56:30 ms ceph-crash[2365]: 2025-02-28T12:56:30.381+0100 780c652006c0 -1 monclient: keyring not found
But according to some posts this is normal and could be ignored
It was similar even when not rebalancing (I have added 2TB drive and removed 1TB Kingston - that is the reason for rebalancing. Kingston is out, but up)
in Ceph log there is nothing special, jsut scrolling info about recovery, but with those slow speeds:
services:
mon: 1 daemons, quorum ms (age 2d)
mgr: ms(active, since 2d)
mds: 1/1 daemons up
osd: 4 osds: 4 up (since 13h), 3 in (since 13h); 50 remapped pgs
data:
volumes: 1/1 healthy
pools: 3 pools, 137 pgs
objects: 452.84k objects, 1.7 TiB
usage: 3.0 TiB used, 2.5 TiB / 5.5 TiB avail
pgs: 176632/905684 objects misplaced (19.503%)
87 active+clean
49 active+remapped+backfill_wait
1 active+remapped+backfilling
io:
recovery: 1.3 MiB/s, 0 objects/s
Please do you have any tips or idea what is going on there?
Last edited: