[Solution] CEPH on ZFS

PanWaclaw

Active Member
Jan 30, 2017
14
19
43
38
Hi, i have many problems to install Ceph OSD on ZFS

I Get you complet solution to resolve it:

Step 1. (repeat on all machine)
Install Ceph - #pveceph install

Step 2. (run only in main machine on cluster)
Init ceph - #pveceph init --network 10.0.0.0/24 -disable_cephx 1
10.0.0.0/24 - your local network
-disable_cephx 1 - disable auth - solve many problem

Step 3. (repeat on all machine uses ceph)
Create Mon and Mgr
#pveceph create mon
#pveceph create mgr

Step 4. (repeat on all machine providing osd)
Create a simple dev on default rpool to use by ceph osd.
#zfs create -V 100G rpool/ceph-osd

ZAP and CREATE OSD
#ceph-volume lvm zap /dev/zd0
#ceph-volume raw prepare --data /dev/zd0 --bluestore --no-tmpfs --crush-device-class HDD
#ceph-volume raw activate --device /dev/zd0 --no-tmpfs --no-systemd

Start OSD and add to start after reboot:
#systemctl start ceph-osd@0
#systemctl enable ceph-osd@0

Fix disk permision
#echo 'KERNEL=="zd0", OWNER="ceph", GROUP="ceph"' >> /etc/udev/rules.d/99-perm.rules
 
This is interesting indeed! Do you by any chance have any performance data ?
 
This is interesting indeed! Do you by any chance have any performance data ?
If you are interested, I can have any performance data :)

However, I didn't care about that, I use hdd drives connected with a 1Gbit network
Low performance - but it does its job perfectly!

HDD per host 187.45 MB / sec
LAN max 100MB / s

You can't expect much, but it works perfectly.
 
I am adding this to my system right now to try out the CEPH performance with ZFS as the underlying system and a enterprise SSD to get the special/slog and cache AND the ZFS ARC to see if it improves the ceph performance and was wondering if you can see an improvement there.
 
Some ZFS ZIL ARCs offer a higher level of performance over HDD. Note that CEPH benefits from end-to-end replication and synchronization - this is the write bottleneck. ARC is doing well, but not with larger files, it doesn't really matter when it comes to speed of the network. 10GBit / s = 1GB / s regular HDD array is doing well, using more SSDs will not improve the maximum network transfer.
 
Updated the guide a bit:


Code:
Ceph on ZFS
Rationale:
Wanting to get the most of my Samsung PM983 Enterprise NVMEs, and more speed in ceph I wanted to
test ceph on top of a non-raidz ZFS to make use of the ARC, SLOG and L2ARC

Prerequisites:
Proxmox (or Debian)
Working ceph installation (MON, MGR)

/dev/sda = Spinning rust 12TB
/dev/sdb = Spinning rust 4TB
/dev/nvme01n1 = Samsung PM983 PLP 1TB



# Prepare the Nvme for special (metadata) 33gb , SLOG 10gb (not used when sync is on!) and cache
parted /dev/nvme0n1
mklabel gpt
mkpart primary 0 33GB
- mkpart primary 33GB 43GB # OPTIONAL SLOG 10gb (not used when sync is on!)
mkpart primary 33GB 256GB

# Clean the spinning disks
sgdisk --zap-all /dev/sda
sgdisk --zap-all /dev/sdb

# Clean any remaining ZFSes
zpool list
zpool destroy TestZFS

# Create a singular ZFS pool in GUI names ZFS-ceph

# Add more disks in cmdline - advantage of non-raidz is that sizes do not have to match on the drives.
# But then you have no redundancy (we leave that to ceph!)
zpool add ZFS-ceph /dev/sdb

zpool add ZFS-ceph special /dev/nvme0n1p1
zpool add ZFS-ceph log /dev/nvme0n1p2 # OPTIONAL SLOG 10gb (not used when sync is on!)
zpool add ZFS-ceph cache /dev/nvme0n1p2

zfs set sync=disabled ZFS-ceph
zfs set atime=off ZFS-ceph
zfs set checksum=off ZFS-ceph # Leave the checksuming/scrubbing to ceph
zfs set xattr=sa ZFS-ceph

# Create volume device chunks from the pool. You should now get /dev/zd0 and /dev/zd16 etc
# They are all 2T to enable uniform distribution in ceph, but this might change as the
# ceph OSDs chug 4gb per device!
zfs create -V 2T ZFS-ceph/ceph-osd1 && ls /dev/zd* && zfs list
zfs create -V 2T ZFS-ceph/ceph-osd2 && ls /dev/zd* && zfs list
...


# ZAP and Create the ceph OSDs on top of the zd* devices. MAKE NOTE of the assigned OSD nr as
# you will need those numbers to start them later
ls /dev/zd*  && ceph-volume lvm zap /dev/zd0
ceph-volume raw prepare --data /dev/zd0 --bluestore --no-tmpfs --crush-device-class hdd && ls /dev/zd* 
ceph-volume raw activate --device /dev/zd0 --no-tmpfs --no-systemd && ls /dev/zd* 

# Start the ceph OSD and add to start after reboot
systemctl start ceph-osd@0
systemctl enable ceph-osd@0

# Run Once - Tell udev to run ceph activate ceph everytime /dev/zd* comes online , otherwise ceph osd
# Also ensure that zd* is owned by ceph user
# will not start again after a reboot of the host due to /dev/zd shifting UUIDs
echo 'KERNEL=="zd*" SUBSYSTEM=="block" ACTION=="add|change" RUN+="ceph-volume raw activate --device /dev/%k --no-tmpfs --no-systemd"' >> /etc/udev/rules.d/99-perm.rules
echo 'KERNEL=="zd*" SUBSYSTEM=="block" ACTION=="add|change" OWNER="ceph", GROUP="ceph"' >> /etc/udev/rules.d/99-perm.rules
# Run Once - Turn off write cache on all disks
echo 'ACTION=="add", SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", RUN+="/usr/sbin/smartctl -s wcache,off $kernel"' >> /etc/udev/rules.d/99-perm.rules


# DONE!
 
Updated the guide a bit:

Code:
Ceph on ZFS

# Add more disks in cmdline - advantage of non-raidz is that sizes do not have to match on the drives.
# But then you have no redundancy (we leave that to ceph!)

# DONE!

How many ProxMox nodes you have in the cluster doing this?
if you don't have 3+, then how does this give you the redundancy?
 
Hi there ! I know this post is sort of old somhow but i'm running into trouble.
I do three servers each with two HDD in RAID 1.
I want to create Ceph OSD with dev on the main raid.
Creating rpool/ceph-osd seems fine but then i can't create the OSD aparently :
Code:
root@node1:~# ceph-volume lvm zap /dev/zd0
--> Zapping: /dev/zd0
--> --destroy was not specified, but zapping a whole device will remove the partition table
Running command: /usr/bin/dd if=/dev/zero of=/dev/zd0 bs=1M count=10 conv=fsync
 stderr: 10+0 records in
10+0 records out
 stderr: 10485760 bytes (10 MB, 10 MiB) copied, 1.26419 s, 8.3 MB/s
--> Zapping successful for: <Raw Device: /dev/zd0>
root@node1:~# ceph-volume raw prepare --data /dev/zd0 --bluestore --no-tmpfs --crush-device-class HDD
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new fa0829a5-c388-4afd-933b-345503b56353
 stderr: 2024-08-22T12:08:51.022+0200 7599a5c006c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
 stderr: 2024-08-22T12:08:51.022+0200 7599a5c006c0 -1 AuthRegistry(0x7599a0063ec8) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
 stderr: 2024-08-22T12:08:51.026+0200 7599a5c006c0 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory
 stderr: 2024-08-22T12:08:51.026+0200 7599a5c006c0 -1 AuthRegistry(0x7599a0063ec8) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 stderr: 2024-08-22T12:08:51.026+0200 7599a5c006c0 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory
 stderr: 2024-08-22T12:08:51.026+0200 7599a5c006c0 -1 AuthRegistry(0x7599a0069020) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 stderr: 2024-08-22T12:08:51.026+0200 7599a5c006c0 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory
 stderr: 2024-08-22T12:08:51.026+0200 7599a5c006c0 -1 AuthRegistry(0x7599a5bff3c0) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 stderr: [errno 2] RADOS object not found (error connecting to the cluster)
-->  RuntimeError: Unable to create a new OSD id

FYI : i'm using the HTML GUI to create and configure my ceph, I dont know if it changes anything. My three machines can be seen on Monitor and Manager

Thanks !
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!