new install with Ceph: 4 nodes, clock skew always detected on 2 mons, constant "active+clean+remapped" status for 5 pgs, poor performance

starkruzr · Dec 13, 2022

well, I "solved" my cephx problem (https://forum.proxmox.com/threads/new-install-cannot-create-ceph-osds-bc-of-keyring-error.119375/) by disabling cephx

now, however, I always have clock skew being complained about on the Ceph dashboard (ex: mon.ganges clock skew 0.131502s > max 0.05s (latency 0.00491829s), mon.orinoco clock skew 0.272502s > max 0.05s (latency 0.00142334s)). that is with Chrony on all 4 hosts syncing to a VM on one host running htpdate (ntp is blocked outgoing on my network). as far as I know there isn't anything "magic" about doing it on a VM that should break anything.

I also have 5 pgs apparently permanently in the "active+clean+remapped" state. `ceph pg repair` does nothing to fix them.

perhaps as a result, I get terrible benchmark results:

Code:

root@ganges:/var/log# rados bench -p fastwrx 10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_ganges_41218
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       135       119   475.976       476   0.0429534    0.128449
    2      16       266       250   499.965       524   0.0186067    0.119404
    3      16       380       364   485.296       456   0.0263719    0.126233
    4      16       503       487   486.961       492    0.284686    0.128938
    5      16       625       609   487.161       488    0.119083    0.129185
    6      16       732       716   477.295       428    0.261727    0.131228
    7      16       862       846   483.391       520   0.0398449     0.12845
    8      16       980       964   481.962       472   0.0338546    0.130552
    9      16      1109      1093   485.739       516   0.0250924    0.129385
   10      16      1256      1240   495.961       588   0.0500059    0.128721
Total time run:         10.1216
Total writes made:      1256
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     496.364
Stddev Bandwidth:       44.1009
Max bandwidth (MB/sec): 588
Min bandwidth (MB/sec): 428
Average IOPS:           124
Stddev IOPS:            11.0252
Max IOPS:               147
Min IOPS:               107
Average Latency(s):     0.128113
Stddev Latency(s):      0.101359
Max latency(s):         0.556049
Min latency(s):         0.0133952
Cleaning up (deleting benchmark objects)
Removed 1256 objects
Clean up completed and total clean up time :0.542294
root@ganges:/var/log#

before reimaging this cluster, it was 3 instead of 4 nodes and the 4th has added an additional NVMe device to the pool. there's now a total of 10 NVMe devices in the pool. this is over 10G networking, all NVMe. my sense is that I should be getting much better numbers than this; I could swear to God it was able to do IOPS well into the thousands and over 1Gbps of sequential write.

any pointers where to start? TIA.

jsterr · Dec 13, 2022

Make a benchmark with rados bench and also check latency while it is running with: watch -n.1 ceph osd perf This will show you latency of each osd. You might find one that has high latency and bad for your overall performance.

You can also check the performance of each single osd with the following command: ceph tell osd.* bench This will show you if you have a bad performing drive.

Why your messing around with non standard-proxmox-ceph installs? Im refering to your previous post, you dont need multiple osd deamons per nvme. Especially if your only having 10Gbit networking. Refering to your prev. post it is also not recommended to go with SIZE 2 and MIN-SIZE2. This means you can not loose a single server. I would recommend reinstalling and use proxmox ui to setup ceph with a 3:2 setup or a 4:2 setup when yourj having 4 nodes.

starkruzr · Dec 13, 2022

jsterr said:
Make a benchmark with rados bench and also check latency while it is running with: watch -n.1 ceph osd perf This will show you latency of each osd. You might find one that has high latency and bad for your overall performance.

You can also check the performance of each single osd with the following command: ceph tell osd.* bench This will show you if you have a bad performing drive.

Why your messing around with non standard-proxmox-ceph installs? Im refering to your previous post, you dont need multiple osd deamons per nvme. Especially if your only having 10Gbit networking. Refering to your prev. post it is also not recommended to go with SIZE 2 and MIN-SIZE2. This means you can not loose a single server. I would recommend reinstalling and use proxmox ui to setup ceph with a 3:2 setup or a 4:2 setup when yourj having 4 nodes.

that benchmark tip is great, thank you - surprisingly, the Flash devices I thought were the fastest actually had the highest apply/commit latencies. now at least I know the culprits to replace.

not sure what you mean by non-standard. multiple OSDs per 2TB NVMe seems to be standard practice; I've done that since I started using Ceph on the advice of Anthony D'Atri from the Ceph mailing lists. I'm also not on 2/2 anymore despite what it says there about the default; all my pools are 3/2 now.

jsterr · Dec 14, 2022

starkruzr said:
that benchmark tip is great, thank you - surprisingly, the Flash devices I thought were the fastest actually had the highest apply/commit latencies. now at least I know the culprits to replace.

not sure what you mean by non-standard. multiple OSDs per 2TB NVMe seems to be standard practice; I've done that since I started using Ceph on the advice of Anthony D'Atri from the Ceph mailing lists. I'm also not on 2/2 anymore despite what it says there about the default; all my pools are 3/2 now.

Nice! I looked at the proxmox ceph benchmarking report and they did not really get any benefits from multiple osd deamons per nvme. Do you have your own comparisons`especially on your 10G-network?

Search

Search

new install with Ceph: 4 nodes, clock skew always detected on 2 mons, constant "active+clean+remapped" status for 5 pgs, poor performance

starkruzr

Well-Known Member

jsterr

Well-Known Member

starkruzr

Well-Known Member

jsterr

Well-Known Member