[Solution] CEPH on ZFS

PanWaclaw

Active Member
Jan 30, 2017
14
19
43
38
Hi, i have many problems to install Ceph OSD on ZFS

I Get you complet solution to resolve it:

Step 1. (repeat on all machine)
Install Ceph - #pveceph install

Step 2. (run only in main machine on cluster)
Init ceph - #pveceph init --network 10.0.0.0/24 -disable_cephx 1
10.0.0.0/24 - your local network
-disable_cephx 1 - disable auth - solve many problem

Step 3. (repeat on all machine uses ceph)
Create Mon and Mgr
#pveceph create mon
#pveceph create mgr

Step 4. (repeat on all machine providing osd)
Create a simple dev on default rpool to use by ceph osd.
#zfs create -V 100G rpool/ceph-osd

ZAP and CREATE OSD
#ceph-volume lvm zap /dev/zd0
#ceph-volume raw prepare --data /dev/zd0 --bluestore --no-tmpfs --crush-device-class HDD
#ceph-volume raw activate --device /dev/zd0 --no-tmpfs --no-systemd

Start OSD and add to start after reboot:
#systemctl start ceph-osd@0
#systemctl enable ceph-osd@0

Fix disk permision
#echo 'KERNEL=="zd0", OWNER="ceph", GROUP="ceph"' >> /etc/udev/rules.d/99-perm.rules
 
This is interesting indeed! Do you by any chance have any performance data ?
 
This is interesting indeed! Do you by any chance have any performance data ?
If you are interested, I can have any performance data :)

However, I didn't care about that, I use hdd drives connected with a 1Gbit network
Low performance - but it does its job perfectly!

HDD per host 187.45 MB / sec
LAN max 100MB / s

You can't expect much, but it works perfectly.
 
I am adding this to my system right now to try out the CEPH performance with ZFS as the underlying system and a enterprise SSD to get the special/slog and cache AND the ZFS ARC to see if it improves the ceph performance and was wondering if you can see an improvement there.
 
Some ZFS ZIL ARCs offer a higher level of performance over HDD. Note that CEPH benefits from end-to-end replication and synchronization - this is the write bottleneck. ARC is doing well, but not with larger files, it doesn't really matter when it comes to speed of the network. 10GBit / s = 1GB / s regular HDD array is doing well, using more SSDs will not improve the maximum network transfer.
 
Updated the guide a bit:


Code:
Ceph on ZFS
Rationale:
Wanting to get the most of my Samsung PM983 Enterprise NVMEs, and more speed in ceph I wanted to
test ceph on top of a non-raidz ZFS to make use of the ARC, SLOG and L2ARC

Prerequisites:
Proxmox (or Debian)
Working ceph installation (MON, MGR)

/dev/sda = Spinning rust 12TB
/dev/sdb = Spinning rust 4TB
/dev/nvme01n1 = Samsung PM983 PLP 1TB



# Prepare the Nvme for special (metadata) 33gb , SLOG 10gb (not used when sync is on!) and cache
parted /dev/nvme0n1
mklabel gpt
mkpart primary 0 33GB
- mkpart primary 33GB 43GB # OPTIONAL SLOG 10gb (not used when sync is on!)
mkpart primary 33GB 256GB

# Clean the spinning disks
sgdisk --zap-all /dev/sda
sgdisk --zap-all /dev/sdb

# Clean any remaining ZFSes
zpool list
zpool destroy TestZFS

# Create a singular ZFS pool in GUI names ZFS-ceph

# Add more disks in cmdline - advantage of non-raidz is that sizes do not have to match on the drives.
# But then you have no redundancy (we leave that to ceph!)
zpool add ZFS-ceph /dev/sdb

zpool add ZFS-ceph special /dev/nvme0n1p1
zpool add ZFS-ceph log /dev/nvme0n1p2 # OPTIONAL SLOG 10gb (not used when sync is on!)
zpool add ZFS-ceph cache /dev/nvme0n1p2

zfs set sync=disabled ZFS-ceph
zfs set atime=off ZFS-ceph
zfs set checksum=off ZFS-ceph # Leave the checksuming/scrubbing to ceph
zfs set xattr=sa ZFS-ceph

# Create volume device chunks from the pool. You should now get /dev/zd0 and /dev/zd16 etc
# They are all 2T to enable uniform distribution in ceph, but this might change as the
# ceph OSDs chug 4gb per device!
zfs create -V 2T ZFS-ceph/ceph-osd1 && ls /dev/zd* && zfs list
zfs create -V 2T ZFS-ceph/ceph-osd2 && ls /dev/zd* && zfs list
...


# ZAP and Create the ceph OSDs on top of the zd* devices. MAKE NOTE of the assigned OSD nr as
# you will need those numbers to start them later
ls /dev/zd*  && ceph-volume lvm zap /dev/zd0
ceph-volume raw prepare --data /dev/zd0 --bluestore --no-tmpfs --crush-device-class hdd && ls /dev/zd* 
ceph-volume raw activate --device /dev/zd0 --no-tmpfs --no-systemd && ls /dev/zd* 

# Start the ceph OSD and add to start after reboot
systemctl start ceph-osd@0
systemctl enable ceph-osd@0

# Run Once - Tell udev to run ceph activate ceph everytime /dev/zd* comes online , otherwise ceph osd
# Also ensure that zd* is owned by ceph user
# will not start again after a reboot of the host due to /dev/zd shifting UUIDs
echo 'KERNEL=="zd*" SUBSYSTEM=="block" ACTION=="add|change" RUN+="ceph-volume raw activate --device /dev/%k --no-tmpfs --no-systemd"' >> /etc/udev/rules.d/99-perm.rules
echo 'KERNEL=="zd*" SUBSYSTEM=="block" ACTION=="add|change" OWNER="ceph", GROUP="ceph"' >> /etc/udev/rules.d/99-perm.rules
# Run Once - Turn off write cache on all disks
echo 'ACTION=="add", SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", RUN+="/usr/sbin/smartctl -s wcache,off $kernel"' >> /etc/udev/rules.d/99-perm.rules


# DONE!
 
Updated the guide a bit:

Code:
Ceph on ZFS

# Add more disks in cmdline - advantage of non-raidz is that sizes do not have to match on the drives.
# But then you have no redundancy (we leave that to ceph!)

# DONE!

How many ProxMox nodes you have in the cluster doing this?
if you don't have 3+, then how does this give you the redundancy?
 
Hi there ! I know this post is sort of old somhow but i'm running into trouble.
I do three servers each with two HDD in RAID 1.
I want to create Ceph OSD with dev on the main raid.
Creating rpool/ceph-osd seems fine but then i can't create the OSD aparently :
Code:
root@node1:~# ceph-volume lvm zap /dev/zd0
--> Zapping: /dev/zd0
--> --destroy was not specified, but zapping a whole device will remove the partition table
Running command: /usr/bin/dd if=/dev/zero of=/dev/zd0 bs=1M count=10 conv=fsync
 stderr: 10+0 records in
10+0 records out
 stderr: 10485760 bytes (10 MB, 10 MiB) copied, 1.26419 s, 8.3 MB/s
--> Zapping successful for: <Raw Device: /dev/zd0>
root@node1:~# ceph-volume raw prepare --data /dev/zd0 --bluestore --no-tmpfs --crush-device-class HDD
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new fa0829a5-c388-4afd-933b-345503b56353
 stderr: 2024-08-22T12:08:51.022+0200 7599a5c006c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
 stderr: 2024-08-22T12:08:51.022+0200 7599a5c006c0 -1 AuthRegistry(0x7599a0063ec8) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
 stderr: 2024-08-22T12:08:51.026+0200 7599a5c006c0 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory
 stderr: 2024-08-22T12:08:51.026+0200 7599a5c006c0 -1 AuthRegistry(0x7599a0063ec8) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 stderr: 2024-08-22T12:08:51.026+0200 7599a5c006c0 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory
 stderr: 2024-08-22T12:08:51.026+0200 7599a5c006c0 -1 AuthRegistry(0x7599a0069020) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 stderr: 2024-08-22T12:08:51.026+0200 7599a5c006c0 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory
 stderr: 2024-08-22T12:08:51.026+0200 7599a5c006c0 -1 AuthRegistry(0x7599a5bff3c0) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 stderr: [errno 2] RADOS object not found (error connecting to the cluster)
-->  RuntimeError: Unable to create a new OSD id

FYI : i'm using the HTML GUI to create and configure my ceph, I dont know if it changes anything. My three machines can be seen on Monitor and Manager

Thanks !