Hi all,
I'm having what seems to be a network nottleneck.
Context is: one of my clients wants to revamp it's infrastructure and was happy already with PVE servers despite having only local zfs backed images, missing out on the broad possibilities offered by Ceph... I wanted to push him to go for Ceph with a layer of cephfs with mountpoints in the containers to share data accross horizontally scaled backends.
I already run such setups without a hitch in our own racks with a 10G bonded loop between machines that use ssd journaled magnetic drives, we don't have a huge payload to manage but never suffered from it.
Due to little availability of the desired servers at OVH and a bit scared by the DC Fire which happened back ago, client ordered two (of the future three) nodes on distant locations (northern and eastern france)... and then it hit me: "Oh cr*p, we're going to have latency issues for Ceph", the average ping is 11ms. (private network is 12Gbs in bw)
Both are beefy AMD Epyc based servers, small system SSDs, and with 4 enterprise grade NVMEs (Samsung PM983 1.92TB) fully dedicated to being OSDs
Well, went ahead and benched the setup a bit and didn't turn out to be that ugly.
Issue is:
Once the real payload was deployed, performance turned out to be excruciatingly painful...
npm installs / composer take ages, whether it's on cephfs or directly within the containers rbd disk, and globally the web payload is just plain slow, even mariadb is abnormally slow.
Pools are all size/min 2/1 (there's a third smaller dev node in the cluster acting as a third mon by safety) 64PGs for images, 64PGs for cephfs
(it will be 3/2 once the third node is ordered)
Is there a way to localize ceph client<>rbd interactions to the local host (and osds) and let ceph sync data asynchronously to the remote node(s)? (without having it wait on acking writes to each receiving osd?)
Willing to try different solutions... Goal being: having a good enough replication with a shared cephfs, with subfolders mounted as mointpoints in various backend containers (logs are centralized elsewhere)
Also set the global rbd_cache to true, but seems ignored since rbd_cache is enabled anyways
Thanks
I'm having what seems to be a network nottleneck.
Context is: one of my clients wants to revamp it's infrastructure and was happy already with PVE servers despite having only local zfs backed images, missing out on the broad possibilities offered by Ceph... I wanted to push him to go for Ceph with a layer of cephfs with mountpoints in the containers to share data accross horizontally scaled backends.
I already run such setups without a hitch in our own racks with a 10G bonded loop between machines that use ssd journaled magnetic drives, we don't have a huge payload to manage but never suffered from it.
Due to little availability of the desired servers at OVH and a bit scared by the DC Fire which happened back ago, client ordered two (of the future three) nodes on distant locations (northern and eastern france)... and then it hit me: "Oh cr*p, we're going to have latency issues for Ceph", the average ping is 11ms. (private network is 12Gbs in bw)
Both are beefy AMD Epyc based servers, small system SSDs, and with 4 enterprise grade NVMEs (Samsung PM983 1.92TB) fully dedicated to being OSDs
Well, went ahead and benched the setup a bit and didn't turn out to be that ugly.
Code:
ONE CEPH NODE (crush map tweaked to chooseleaf type osd and avoid it complaining):
local ceph rbd:
root@web-back-01:~/tmp# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
write: IOPS=48.0k, BW=188MiB/s (197MB/s)(12.0GiB/65501msec); 0 zone resets
Run status group 0 (all jobs):
WRITE: bw=188MiB/s (197MB/s), 188MiB/s-188MiB/s (197MB/s-197MB/s), io=12.0GiB (12.9GB), run=65501-65501msec
root@web-back-01:~/tmp# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1
16x : write: IOPS=2630, BW=164MiB/s (172MB/s)(9984MiB/60734msec); 0 zone resets
Run status group 0 (all jobs):
WRITE: bw=2628MiB/s (2755MB/s), 160MiB/s-170MiB/s (168MB/s-178MB/s), io=156GiB (168GB), run=60192-60824msec
cephfs mount point:
root@web-back-01:/shares/web/tmp# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
Run status group 0 (all jobs):
write: IOPS=71.5k, BW=279MiB/s (293MB/s)(16.0GiB/62300msec); 0 zone resets
WRITE: bw=279MiB/s (293MB/s), 279MiB/s-279MiB/s (293MB/s-293MB/s), io=16.0GiB (18.2GB), run=62300-62300msec
root@web-back-01:/shares/web/tmp# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1
write: IOPS=8653, BW=541MiB/s (567MB/s)(32.3GiB/61060msec); 0 zone resets
Run status group 0 (all jobs):
WRITE: bw=8727MiB/s (9151MB/s), 507MiB/s-594MiB/s (532MB/s-623MB/s), io=520GiB (559GB), run=60755-61062msec
TWO CEPH NODES (once the second was ordered, crushmap set back to original) cephfs:
root@web-back-01:/shares/web/tmp# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
write: IOPS=60.8k, BW=238MiB/s (249MB/s)(14.4GiB/61919msec); 0 zone resets
Run status group 0 (all jobs):
WRITE: bw=238MiB/s (249MB/s), 238MiB/s-238MiB/s (249MB/s-249MB/s), io=14.4GiB (15.4GB), run=61919-61919msec
root@web-back-01:/shares/web/tmp# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1
16x write: IOPS=10.2k, BW=639MiB/s (670MB/s)(39.5GiB/63339msec); 0 zone resets
Run status group 0 (all jobs):
WRITE: bw=10.8GiB/s (11.6GB/s), 566MiB/s-808MiB/s (593MB/s-847MB/s), io=686GiB (736GB), run=62542-63350msec
Issue is:
Once the real payload was deployed, performance turned out to be excruciatingly painful...
npm installs / composer take ages, whether it's on cephfs or directly within the containers rbd disk, and globally the web payload is just plain slow, even mariadb is abnormally slow.
Pools are all size/min 2/1 (there's a third smaller dev node in the cluster acting as a third mon by safety) 64PGs for images, 64PGs for cephfs
(it will be 3/2 once the third node is ordered)
Is there a way to localize ceph client<>rbd interactions to the local host (and osds) and let ceph sync data asynchronously to the remote node(s)? (without having it wait on acking writes to each receiving osd?)
Willing to try different solutions... Goal being: having a good enough replication with a shared cephfs, with subfolders mounted as mointpoints in various backend containers (logs are centralized elsewhere)
Also set the global rbd_cache to true, but seems ignored since rbd_cache is enabled anyways
Thanks