Ceph Single Node vs ZFS

zeuxprox

Renowned Member
Dec 10, 2014
92
5
73
Hi,

a client of mine currently has only one server with the following characteristics:

  • Server: SUPERMICRO Server AS -1113S-WN10RT
  • CPU: 1 x AMD EPYC 7502 32C
  • RAM: 512 GB ECC REC
  • NIC 1: 4 x 10Gb SFP+ Intel XL710-AM1 (AOC-STG-I4S)
  • NIC 2: 4 x 10Gb SFP+ Intel XL710-AM1 (AOC-STG-I4S)
  • SSD for OS : 2 x Samsung PM983,960GB,NVMe for OS in RAID 1
  • SSD for VMs: 5 x Micron 9300 MAX 6.4 TB NVMe
I will install Proxmox on 2 SSD Samsung, zfs RAID 1 (Mirror), but for VMs what is the best choice between the following two, considering performance, scalability and data security:
  1. Ceph one node;
  2. ZFS RAIDZ2 ?
Consider that I want to be able to lose two disks simultaneously and continue working. Consider also that I will use Proxmox Backup Server to execute VMs incremental backups.
If the choice falls on the first solution (Ceph on node), what should the configuration be like?

In reading and writing what kind of performance can I expect?

Thank you
 
In reading and writing what kind of performance can I expect?

As @ermanishchawla already described, CEPH is out, so you're "stuck" with ZFS. The performance of RAIDz2 is very, very slow and you will waste a lot of space with such a setup (please search the forums), but if you client asks for a "loose any two disks", you're stuck with this. I'd recommend to use stripped mirrors (RAID10) and use different brands of SSDs. If the series has a problem, all SSDs will fail simultaneous, because they're exposed to the same writing pattern.
 
Hi,

Yes, I know that Ceph needs at least 3 nodes, and for a stand alone server ZFS is the "natural" choice. My question arose because in the near future the customer will add two more servers and with ceph I would have had the Storage ready. It would have been enough to add nodes to Ceph ...

Speaking of ZFS, with 5 NVMe disks, Speaking of ZFS, what type of RAIDZX do you recommend, considering that data security is fundamental?

Thank you
 
[...]

Speaking of ZFS, with 5 NVMe disks, Speaking of ZFS, what type of RAIDZX do you recommend, considering that data security is fundamental?

Thank you
This might help you gauge the performance impact. Post #15 and Post#16
https://forum.proxmox.com/threads/zfs-configuration-planning-for-virtual-machines.73445/post-328913

I am in the same boat (3 node cluster) with 6-8 NVME SSDs) and i am gonna go with ZFS based raid10 and Storage Replication of 1 Minute. Don't exspect tests before October/November tho.
 
Last edited:
Ceph will absolutely run on a single node, it's just not normally a practical option as you're limiting yourself to only a single host for redundancy so there's generally better suited options available for single node applications. You would need to make sure your CRUSH map splits PGs over OSDs instead of hosts which is the default, but other than that it's straightforward to set up as a single node given you wouldn't be messing around with size and min_size since you'd have the 5 OSDs.

I wouldn't really consider the performance or data security differences between them. ZFS will perform better, data security will be fine on either one if you just mean bit rot eating your data. One concern might be that if you don't understand how Ceph works it can be easier to metaphorically shoot yourself in the foot than with ZFS as far as data loss goes. The real consideration here is if you will have a use for Ceph once the other two servers get added. You're kind of comparing Apples and Oranges here. Ceph is a clustered storage system whereas ZFS is just local. You can replicate VMs between hosts on ZFS but that comes with its own downsides like no live migration whereas with Ceph you can live migrate and writes to disk aren't committed until it's written to multiple separate hosts. With Ceph, if you lose a host and you have HA set up it can relaunch that VM on a different host and it'll be like you just did a hard reset of the VM whereas with ZFS and storage replication your VM might be a little out of date depending on how frequently you set up replication. For some applications like databases this might not be okay for it to accept some transaction, "commit" the transaction, and then have the DB appear to go back in time and unwind those supposedly durable changes.

The considerations around clustered storage vs local storage are much more significant of a concern than just raw performance and scalability IMHO. If you're wanting Ceph later on once you have 3 nodes I'd go with Ceph from the start rather than ZFS at first and migrating into Ceph later.
 
Hi,

I opted to use ZFS. Now I should decide whether to use RAIDZ2 or RAID10 with the fifth disk as SPARE. I have currently tested RAIDZ2 with:
Code:
fio --name=randwrite --output /NVME-1-VM --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=0 --size=16384M --numjobs=4 --runtime=240 --group_reporting

and the result of fio is:
Code:
Jobs: 4 (f=4): [w(4)][100.0%][w=199MiB/s][w=50.9k IOPS][eta 00m:00s]
randwrite: (groupid=0, jobs=4): err= 0: pid=22274: Wed Aug 12 09:16:29 2020
  write: IOPS=50.7k, BW=198MiB/s (208MB/s)(46.4GiB/240005msec); 0 zone resets
    slat (usec): min=3, max=25574, avg=77.46, stdev=338.90
    clat (nsec): min=240, max=11324k, avg=495.24, stdev=16316.83
     lat (usec): min=4, max=25579, avg=78.11, stdev=340.33
    clat percentiles (nsec):
     |  1.00th=[  270],  5.00th=[  290], 10.00th=[  302], 20.00th=[  310],
     | 30.00th=[  322], 40.00th=[  322], 50.00th=[  342], 60.00th=[  370],
     | 70.00th=[  410], 80.00th=[  450], 90.00th=[  502], 95.00th=[  548],
     | 99.00th=[ 1176], 99.50th=[ 2256], 99.90th=[ 4768], 99.95th=[ 5280],
     | 99.99th=[73216]
   bw (  KiB/s): min=38088, max=58848, per=24.99%, avg=50667.76, stdev=2246.52, samples=1920
   iops        : min= 9522, max=14712, avg=12666.90, stdev=561.65, samples=1920
  lat (nsec)   : 250=0.01%, 500=89.79%, 750=8.36%, 1000=0.65%
  lat (usec)   : 2=0.61%, 4=0.34%, 10=0.22%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=1.57%, sys=61.93%, ctx=724056, majf=7, minf=58
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,12165608,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=198MiB/s (208MB/s), 198MiB/s-198MiB/s (208MB/s-208MB/s), io=46.4GiB (49.8GB), run=240005-240005msec


while the one reported by zpool iostat NVME-1-VM 2 is:
Code:
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
NVME-1-VM   23.5G  29.1T      0   122K      0  2.49G
NVME-1-VM   23.5G  29.1T      0   121K      0  2.50G
NVME-1-VM   23.6G  29.1T      0   118K      0  2.49G


I have used also iostat -hxdm 2:
Code:
 r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util Device
 0.00 23418.50      0.0k    502.3M     0.00     0.00   0.0%   0.0%    0.00    0.03   0.00     0.0k    22.0k   0.03  75.0% nvme6n1
 0.00 23494.00      0.0k    503.5M     0.00     0.00   0.0%   0.0%    0.00    0.03   0.00     0.0k    21.9k   0.03  74.8% nvme5n1
 0.00 24029.50      0.0k    503.2M     0.00     0.00   0.0%   0.0%    0.00    0.03   0.00     0.0k    21.4k   0.03  74.4% nvme4n1
 0.00 24123.00      0.0k    503.3M     0.00     0.00   0.0%   0.0%    0.00    0.03   0.00     0.0k    21.4k   0.03  74.6% nvme2n1
 0.00 24160.00      0.0k    503.5M     0.00     0.00   0.0%   0.0%    0.00    0.03   0.00     0.0k    21.3k   0.03  74.6% nvme3n1


How do you judge these performances? And what improvements could I expect from using RAID10?

Thank you
 
Hi; I am currently on vacation and just catching up on some threads ...

Have you read these posts? :
https://forum.proxmox.com/threads/zfs-configuration-planning-for-virtual-machines.73445/post-328913
https://forum.proxmox.com/threads/zfs-configuration-planning-for-virtual-machines.73445/post-328941

There is a reference to a book called:
Jude, Lucas FreeBSD Mastery: ZFS, page 42 (my copy it was page 72)

It goes in-depth over the characteristics of a ZFS filesystem, especially raid10 vs RaidZ1/2 from a performance point of view. can only recommend. good read.
 
Jobs: 4 (f=4): [w(4)][100.0%][w=199MiB/s][w=50.9k IOPS][eta 00m:00s]
That's also what a single disk is able to run (no tuning). Test with a RAID10, you should see an increase in IO/s.
 
Hi,

I made a test with a RAID10 + one spare disk, but I'm a little confused by the results...

I run the test with:

Code:
fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=0 --size=16384M --numjobs=4 --runtime=240 --group_reporting

and the result of fio is:
Code:
Jobs: 4 (f=4): [w(4)][100.0%][w=234MiB/s][w=59.9k IOPS][eta 00m:00s]
randwrite: (groupid=0, jobs=4): err= 0: pid=15411: Wed Aug 26 12:24:50 2020
  write: IOPS=60.5k, BW=236MiB/s (248MB/s)(55.4GiB/240001msec); 0 zone resets
    slat (usec): min=3, max=40619, avg=64.85, stdev=309.03
    clat (nsec): min=240, max=22155k, avg=447.20, stdev=17431.89
     lat (usec): min=4, max=40619, avg=65.43, stdev=310.51
    clat percentiles (nsec):
     |  1.00th=[  282],  5.00th=[  290], 10.00th=[  302], 20.00th=[  310],
     | 30.00th=[  322], 40.00th=[  322], 50.00th=[  330], 60.00th=[  342],
     | 70.00th=[  350], 80.00th=[  402], 90.00th=[  482], 95.00th=[  540],
     | 99.00th=[  892], 99.50th=[ 1784], 99.90th=[ 4960], 99.95th=[ 5536],
     | 99.99th=[12864]
   bw (  KiB/s): min=41192, max=65736, per=24.99%, avg=60510.20, stdev=2051.12, samples=1916
   iops        : min=10298, max=16434, avg=15127.51, stdev=512.78, samples=1916
  lat (nsec)   : 250=0.01%, 500=91.65%, 750=6.98%, 1000=0.49%
  lat (usec)   : 2=0.42%, 4=0.22%, 10=0.22%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=1.75%, sys=68.69%, ctx=552428, majf=0, minf=43
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,14527678,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=236MiB/s (248MB/s), 236MiB/s-236MiB/s (248MB/s-248MB/s), io=55.4GiB (59.5GB), run=240001-240001msec

while the one reported by zpool iostat 2 is:
Code:
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
NVME-VM-1   13.6G  11.6T      0  44.6K      0  3.38G
NVME-VM-1   13.7G  11.6T      0  45.5K      0  3.47G
NVME-VM-1   13.7G  11.6T      0  45.2K      0  3.41G
NVME-VM-1   13.8G  11.6T      0  45.0K      0  3.38G
NVME-VM-1   13.9G  11.6T      0  46.4K      0  3.46G
NVME-VM-1   13.9G  11.6T      0  31.9K      0  2.43G
NVME-VM-1   13.9G  11.6T      0      0      0      0

iostat -hxdm 2 reported:
Code:
     r/s     w/s         rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util Device
    0.00     11837.00      0.0k    892.2M     0.00     0.00   0.0%   0.0%    0.00    0.09   0.00     0.0k    77.2k   0.06  65.8% nvme5n1
    0.00     11853.50      0.0k    892.1M     0.00     0.00   0.0%   0.0%    0.00    0.09   0.00     0.0k    77.1k   0.06  65.8% nvme4n1
    0.00     11600.50      0.0k    899.6M     0.00     0.00   0.0%   0.0%    0.00    0.09   0.00     0.0k    79.4k   0.06  65.6% nvme2n1
    0.00     11923.00      0.0k    899.4M     0.00     0.00   0.0%   0.0%    0.00    0.09   0.00     0.0k    77.2k   0.06  65.6% nvme3n1

I would have expected a performance boost, especially for IOPS, and the results from fio confirm this, but not by zpool iostat and iostat.
The results of zpool iostat 2 are (average):
Code:
RAID10:  45.5K
RAIDZ2: 121.0K

while iostat -hxdm 2 reported (average):

Code:
RAID10: 11853.50 w/s
RAIDZ2: 24029.50 ws


the question is: why do zpool and iostat give these results?

Thank you
 
First off, direct and sync should be enabled, otherwise you will test some caches as well. And I would start with write/read benchmarks, as random write/read will in most cases be slower.

Then the different RAID modes have different write patterns. Especially when the RAID10 is a three-way mirror. Where write are split and then written onto two of the mirrors.
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_zfs
 
Ceph cannot run on single node, it require minimum 3 nodes. So Ceph is out

zeuxprox , you can do ceph on single node, we need to tune ceph to do not host based replication but HDD/OSD based.​

you can follow in detail : https://linoxide.com/hwto-configure-single-node-ceph-cluster/

Quick detail is as below , which we did for our DR location which was single node , while Primary was 5 Node. and than we wrote a script to do Async ceph snapshot export/import to push data to our DR server for safety , https://github.com/deependhulla/ceph-dr-sync-tool-for-proxmox

Quick Setup :
cd /root/
ceph osd getcrushmap -o crush_map_compressed
crushtool -d crush_map_compressed -o crush_map_decompressed
cp crush_map_decompressed crush_map_decompressed_backup_orginal
# edit file crush_map_decompressed (in vi or mcedit)
# Search for : step chooseleaf firstn 0 type host
# Cghange it to : step chooseleaf firstn 0 type osd
# save the file : crush_map_decompressed
crushtool -c crush_map_decompressed -o new_crush_map_compressed
ceph osd setcrushmap -i new_crush_map_compressed
## now check ceph -s : It is now showing an active+clean state.

How this guide help to test out single node ceph setup.
 
Last edited:

zeuxprox , you can do ceph on single node, we need to tune ceph to do not host based replication but HDD/OSD based.​

you can follow in detail : https://linoxide.com/hwto-configure-single-node-ceph-cluster/

Quick detail is as below , which we did for our DR location which was single node , while Primary was 5 Node. and than we wrote a script to do Async ceph snapshot export/import to push data to our DR server for safety , https://github.com/deependhulla/ceph-dr-sync-tool-for-proxmox

Quick Setup :
cd /root/
ceph osd getcrushmap -o crush_map_compressed
crushtool -d crush_map_compressed -o crush_map_decompressed
cp crush_map_decompressed crush_map_decompressed_backup_orginal
# edit file crush_map_decompressed (in vi or mcedit)
# Search for : step chooseleaf firstn 0 type host
# Cghange it to : step chooseleaf firstn 0 type osd
# save the file : crush_map_decompressed
crushtool -c crush_map_decompressed -o new_crush_map_compressed
ceph osd setcrushmap -i new_crush_map_compressed
## now check ceph -s : It is now showing an active+clean state.

How this guide help to test out single node ceph setup.

if purpose is only for testing, ceph can be tuned to run on single node, but it will not fetch out anything. The derivation of ceph performance is based on attributing network latency, OSD_SLOWNESS so if it is single node setup where replication is running with OSD on same server. You will not see any good when you go into production

Hence again I would see, use minimum 3 node setup.
 
if purpose is only for testing, ceph can be tuned to run on single node, but it will not fetch out anything. The derivation of ceph performance is based on attributing network latency, OSD_SLOWNESS so if it is single node setup where replication is running with OSD on same server. You will not see any good when you go into production

Hence again I would see, use minimum 3 node setup.
Best use of Ceph is multi-node and we too highly recommended that for Primary, but in our case , we wanted a additional bakup on DR-location on Single Server with 3-HDD to use Ceph snapshot import/export to push to remote location very frequently;

On practical front we have not got into any slowness as there was no network was used as its on same host.
And its running for our DR for an Year now, so in practical yes Ceph Crush Algo : step chooseleaf firstn 0 type host has given us a stable output.
Ceph is a great software.
 
Best use of Ceph is multi-node and we too highly recommended that for Primary, but in our case , we wanted a additional bakup on DR-location on Single Server with 3-HDD to use Ceph snapshot import/export to push to remote location very frequently;

On practical front we have not got into any slowness as there was no network was used as its on same host.
And its running for our DR for an Year now, so in practical yes Ceph Crush Algo : step chooseleaf firstn 0 type host has given us a stable output.
Ceph is a great software.


Yes Ceph indeed is very great, the new version integrated with proxmox (octopus) is indeed have better performance, better features
I am using with bucket level rack and bucket level pod. The performance is phenomenal
 
  • Like
Reactions: Deepen Dhulla

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!