[Solution] CEPH on ZFS

PanWaclaw

Active Member
Jan 30, 2017
14
14
43
38
Hi, i have many problems to install Ceph OSD on ZFS

I Get you complet solution to resolve it:

Step 1. (repeat on all machine)
Install Ceph - #pveceph install

Step 2. (run only in main machine on cluster)
Init ceph - #pveceph init --network 10.0.0.0/24 -disable_cephx 1
10.0.0.0/24 - your local network
-disable_cephx 1 - disable auth - solve many problem

Step 3. (repeat on all machine uses ceph)
Create Mon and Mgr
#pveceph create mon
#pveceph create mgr

Step 4. (repeat on all machine providing osd)
Create a simple dev on default rpool to use by ceph osd.
#zfs create -V 100G rpool/ceph-osd

ZAP and CREATE OSD
#ceph-volume lvm zap /dev/zd0
#ceph-volume raw prepare --data /dev/zd0 --bluestore --no-tmpfs --crush-device-class HDD
#ceph-volume raw activate --device /dev/zd0 --no-tmpfs --no-systemd

Start OSD and add to start after reboot:
#systemctl start ceph-osd@0
#systemctl enable ceph-osd@0

Fix disk permision
#echo 'KERNEL=="zd0", OWNER="ceph", GROUP="ceph"' >> /etc/udev/rules.d/99-perm.rules
 
This is interesting indeed! Do you by any chance have any performance data ?
 
This is interesting indeed! Do you by any chance have any performance data ?
If you are interested, I can have any performance data :)

However, I didn't care about that, I use hdd drives connected with a 1Gbit network
Low performance - but it does its job perfectly!

HDD per host 187.45 MB / sec
LAN max 100MB / s

You can't expect much, but it works perfectly.
 
I am adding this to my system right now to try out the CEPH performance with ZFS as the underlying system and a enterprise SSD to get the special/slog and cache AND the ZFS ARC to see if it improves the ceph performance and was wondering if you can see an improvement there.
 
Some ZFS ZIL ARCs offer a higher level of performance over HDD. Note that CEPH benefits from end-to-end replication and synchronization - this is the write bottleneck. ARC is doing well, but not with larger files, it doesn't really matter when it comes to speed of the network. 10GBit / s = 1GB / s regular HDD array is doing well, using more SSDs will not improve the maximum network transfer.
 
Updated the guide a bit:


Code:
Ceph on ZFS
Rationale:
Wanting to get the most of my Samsung PM983 Enterprise NVMEs, and more speed in ceph I wanted to
test ceph on top of a non-raidz ZFS to make use of the ARC, SLOG and L2ARC

Prerequisites:
Proxmox (or Debian)
Working ceph installation (MON, MGR)

/dev/sda = Spinning rust 12TB
/dev/sdb = Spinning rust 4TB
/dev/nvme01n1 = Samsung PM983 PLP 1TB



# Prepare the Nvme for special (metadata) 33gb , SLOG 10gb (not used when sync is on!) and cache
parted /dev/nvme0n1
mklabel gpt
mkpart primary 0 33GB
- mkpart primary 33GB 43GB # OPTIONAL SLOG 10gb (not used when sync is on!)
mkpart primary 33GB 256GB

# Clean the spinning disks
sgdisk --zap-all /dev/sda
sgdisk --zap-all /dev/sdb

# Clean any remaining ZFSes
zpool list
zpool destroy TestZFS

# Create a singular ZFS pool in GUI names ZFS-ceph

# Add more disks in cmdline - advantage of non-raidz is that sizes do not have to match on the drives.
# But then you have no redundancy (we leave that to ceph!)
zpool add ZFS-ceph /dev/sdb

zpool add ZFS-ceph special /dev/nvme0n1p1
zpool add ZFS-ceph log /dev/nvme0n1p2 # OPTIONAL SLOG 10gb (not used when sync is on!)
zpool add ZFS-ceph cache /dev/nvme0n1p2

zfs set sync=disabled ZFS-ceph
zfs set atime=off ZFS-ceph
zfs set checksum=off ZFS-ceph # Leave the checksuming/scrubbing to ceph
zfs set xattr=sa ZFS-ceph

# Create volume device chunks from the pool. You should now get /dev/zd0 and /dev/zd16 etc
# They are all 2T to enable uniform distribution in ceph, but this might change as the
# ceph OSDs chug 4gb per device!
zfs create -V 2T ZFS-ceph/ceph-osd1 && ls /dev/zd* && zfs list
zfs create -V 2T ZFS-ceph/ceph-osd2 && ls /dev/zd* && zfs list
...


# ZAP and Create the ceph OSDs on top of the zd* devices. MAKE NOTE of the assigned OSD nr as
# you will need those numbers to start them later
ls /dev/zd*  && ceph-volume lvm zap /dev/zd0
ceph-volume raw prepare --data /dev/zd0 --bluestore --no-tmpfs --crush-device-class hdd && ls /dev/zd* 
ceph-volume raw activate --device /dev/zd0 --no-tmpfs --no-systemd && ls /dev/zd* 

# Start the ceph OSD and add to start after reboot
systemctl start ceph-osd@0
systemctl enable ceph-osd@0

# Run Once - Tell udev to run ceph activate ceph everytime /dev/zd* comes online , otherwise ceph osd
# Also ensure that zd* is owned by ceph user
# will not start again after a reboot of the host due to /dev/zd shifting UUIDs
echo 'KERNEL=="zd*" SUBSYSTEM=="block" ACTION=="add|change" RUN+="ceph-volume raw activate --device /dev/%k --no-tmpfs --no-systemd"' >> /etc/udev/rules.d/99-perm.rules
echo 'KERNEL=="zd*" SUBSYSTEM=="block" ACTION=="add|change" OWNER="ceph", GROUP="ceph"' >> /etc/udev/rules.d/99-perm.rules
# Run Once - Turn off write cache on all disks
echo 'ACTION=="add", SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", RUN+="/usr/sbin/smartctl -s wcache,off $kernel"' >> /etc/udev/rules.d/99-perm.rules


# DONE!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!