Migrate ZFS to Ceph?

Springtime

Member
Nov 18, 2024
31
5
8
Hello,
would it be possible to migrate from replicated ZFS to Ceph, when having 3 nodes in a cluster, without disturbing the production?
Note: never used Ceph, don't actually know anything yet about it, and all storage is built in, 3 servers having HBA with 8 flash drives each.
Thanks
 
Last edited:
Your goal would be to reuse the same hardware including disks, migrate from ZFS to Ceph, live on a production setup without disturbing anything? *Maybe* theoretically possible with some serious planning (and risks) but my recommendation would be to first read up on Ceph, understand how you would like to set it up, and then plan for downtime while you transition. You would likely need some intermediate storage while you rebuild the storage as a Ceph cluster and then migrate your data back again.

I did a similar thing *but* my Ceph storage was on separate hardware (disks) so I could run both in parallel during transition.
 
You would likely need some intermediate storage while you rebuild the storage as a Ceph cluster and then migrate your data back again.
I was more or less expecting this answer. Nevertheless, never hurts to ask.
There is no available intermediate storage, since alone the file server has the size larger than the local disks on which Proxmox resides (which I could in theory use as transition storage).
So I guess the most clean option would be to backup the VMs, remove ZFS, create Ceph and then restore from backup. And that on weekend.
 
Yes, if you have on the same machines zfs and ceph, just livemigrate storate from one to another. This is it
 
So I guess the most clean option would be to backup the VMs, remove ZFS, create Ceph and then restore from backup. And that on weekend.
Yes. Unless you have enough hardware to run both Ceph and ZFS in parallel and migrate from one to the other.
 
to Ceph, when having 3 nodes in a cluster
You want to enhance your current situation, right?

Are you sure you want Ceph with only three nodes? A reliable system with features like self-healing has some pitfalls - beside the (usually) much lower performance than local storage:

As a ZFS fanboy I need to mention this one too:
 
  • Like
Reactions: gurubert
What I have read up until now, I believe the answer is:
Are you sure you want Ceph with only three nodes?
Yes.
Why:
The usage of only 1/3 of the total storage is fine, at least here in office. We don't have as much data, not even 1TB, and currently with only 4 disks used on each node, I have 3.6TB. And this will go up when I add additional disks - we have 3 nodes, each with 8 SSDs * 960GB. All original enterprise HPE SSDs, same type in all 3 nodes.
And about space: with ZFS, I used replication to other two nodes, so each node could fail and still the other would pick up, minus the difference between replications. But, that also means that I basically only have 1/3 of space available, since I have 3 copies of data. But of course... I can choose not to.
The only benefit I see in ZFS compared to Ceph, when it comes to HA, is that with ZFS replication, actually two nodes could fail, if the storage/RAM/CPU are enough. Which they are here.
However... when running ZFS, I choose to use RAIDZ2, meaning I already loose some redundancy on the node itself and when I replicate two times, I lose 2/3 of the remaining. So, ZFS RAIDZ2 with replication costs me way more storage, if I understand it correctly.
With Ceph, to remain R/W, only one node could fail, if I understand correctly. Which would be enough.

When it comes to OSDs, I am still missing complete understanding what they do, even if I read about it. Basically I created Ceph with 4 disks from each node, 4 SSDs are unused currently. I thought to test how it is adding disks to the Ceph cluster. As I gathered, I need to create one OSD per physical disk.

The current issue I have is checking network, and in all cases I get full 10G, except when doing iperf to Node3, which is one generation older DL380, Gen9, other two are Gen10. But, I've created a separate thread about that.

Oh, and let me be clear on one thing: this is more or less POC enivonment. To learn about Ceph. Because that is the only possibility we have to test it before even thinking about implementing that in our datacenters (which we have two of them, each Azure Local based on 6 nodes).
 
Last edited:
The usual minimum for Ceph is "size=3,min_size=2". This talks about Nodes, not OSDs as the "failure domain" is (of course) "host".

This means you will store each and every data block three times = one time on each host. The usable space is obviously one third of all space on all OSDs.

(Actually it is lower: in the moment one device fails Ceph will re-balance data ("PG"s) --> using more space on the surviving OSDs than a moment ago. Then the Ceph pool gets "full" earlier than expected. A full pool is bad for both Ceph and also ZFS - you really want to avoid that.)

When I use ZFS replication with three nodes I get exactly the same 1/3 ratio.

All of my ZFS storages are mirrored vdevs. This aspect results in a halfed net space compared to Ceph - for me in my specific the discussed example setup.
 
Last edited:
  • Like
Reactions: Johannes S
When it comes to OSDs, I am still missing complete understanding what they do, even if I read about it. Basically I created Ceph with 4 disks from each node, 4 SSDs are unused currently. I thought to test how it is adding disks to the Ceph cluster. As I gathered, I need to create one OSD per physical disk.
At a high level, each disk is an OSD, yes. Ceph will store a block of data as configured...the default/normal 3/2 setup will store 1 copy of the block on 3 servers. Ceph will choose which OSDs to use. If there are only 2 copies (OSD fails, server fails) it will create another copy. If there are less than 2 copies data for the VM can't be written.

Pros include: storage is available to all physical servers (VM migration only copies RAM contents), you can add disks at any time to increase space.
 
OK, if I understand correctly, in 3/2, a whole host can fail. What happens, if additional OSDs (disks) on one of the two nodes also fail? Say, one node is down and additionally, 2 SSDs out of 8 on one of the nodes? But there are 8 OSDs on each server, so 24 in total.

Storage: yes, I am very much aware that one should not fill the storage both with Ceph and ZFS. Currently, we use around 1TB, which won't grow unless our management comes to ideas that we move certain loads on this cluster.
But one thing I don't quite understand. In 3/2, only third of the storage should be available. However, "Usage" on Datacenter level is saying 267GB of 21TB (1%). I currently have only one VM on it, everything else is spread on local disks right now. If I look down to storage, directly on one of the nodes, it says 84GB out of 7.18TB. So, how it sounds, datacenter is total, and real free space, due to 3/2 is actually shown under storage (on each node the same).
It's good to know that the usage will grow if one node fails. That is one scenario I want to test.

The next logical step for me now, that I have everything basic-configured, is exploring the thing about performance, benchmarking and tweaking.
Can you recommend me specific ways of benchmarking, creating a baseline, how to recognize what are "good numbers"?

Besides, I find it great to be able to create CephFS, so that my ISOs don't have to be uploaded 3 times. Just a nice central storage.
 
Last edited:
Here are some numbers, and as I said, not sure how to judge them.

Code:
:~# rados bench -p testbench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_s02p00a3101_776180
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       224       208   831.846       832   0.0511275   0.0747983
    2      16       459       443   885.849       940    0.052232   0.0712619
    3      16       688       672   895.871       916    0.063132    0.070411
    4      16       908       892   891.871       880   0.0541496   0.0708804
    5      16      1132      1116   892.664       896   0.0287243   0.0711261
    6      16      1377      1361   907.194       980   0.0527882   0.0700263
    7      16      1623      1607   918.143       984    0.065819   0.0691246
    8      16      1871      1855   927.356       992   0.0560834    0.068565
    9      16      2103      2087   927.417       928    0.080184     0.06875
   10      16      2328      2312   924.666       900   0.0344433   0.0686554
Total time run:         10.0628
Total writes made:      2328
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     925.388
Stddev Bandwidth:       51.0442
Max bandwidth (MB/sec): 992
Min bandwidth (MB/sec): 832
Average IOPS:           231
Stddev IOPS:            12.7611
Max IOPS:               248
Min IOPS:               208
Average Latency(s):     0.0689401
Stddev Latency(s):      0.0517506
Max latency(s):         1.09492
Min latency(s):         0.0226393


:~# rados bench -p testbench 10 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      15       538       523   2090.88      2092   0.0307482   0.0295879
    2      16      1065      1049   2097.26      2104   0.0243273   0.0296936
    3      16      1596      1580   2106.03      2124   0.0433334   0.0296689
    4      16      2156      2140   2139.43      2240    0.063928   0.0291874
Total time run:       4.32856
Total reads made:     2328
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2151.29
Average IOPS:         537
Stddev IOPS:          16.9902
Max IOPS:             560
Min IOPS:             523
Average Latency(s):   0.0290851
Max latency(s):       0.174107
Min latency(s):       0.0134061


:~# rados bench -p testbench 10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      15       531       516    2063.1      2064   0.0161054   0.0299842
    2      16      1050      1034   2067.43      2072   0.0141886    0.028746
    3      16      1279      1263   1683.59       916   0.0160117   0.0267895
    4      16      1297      1281   1280.71        72   0.0160864   0.0266305
    5      15      1513      1498   1198.05       868   0.0381363   0.0526714
    6      15      2019      2004    1335.6      2024   0.0155546   0.0461387
    7      15      2251      2236    1277.3       928   0.0172748   0.0432012
    8      16      2280      2264   1131.66       112   0.0155167   0.0428599
    9      16      2448      2432   1080.58       672   0.0166542   0.0585538
   10      15      2975      2960   1183.68      2112   0.0189133   0.0530766
   11       6      2975      2969   1079.36        36   0.0274126   0.0529744
   12       6      2975      2969   989.422         0           -   0.0529744
Total time run:       12.6516
Total reads made:     2975
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   940.592
Average IOPS:         235
Stddev IOPS:          216.918
Max IOPS:             528
Min IOPS:             0
Average Latency(s):   0.0585469
Max latency(s):       3.06513
Min latency(s):       0.00262166
 
What happens, if additional OSDs (disks) on one of the two nodes also fail? Say, one node is down and additionally, 2 SSDs out of 8 on one of the nodes?
Ceph will try to reallocate "missing" storage blocks. If the other 6 drives can take the data from those two failed SSDs then you're OK.

This is also why it's helpful to have more Ceph nodes, so one host failure (or reboot) doesn't cause a problem.

One can also get into trouble by having, say, 3 nodes with storage allocated unevenly: 10 TB, 5 TB, 5 TB. It can't use all of the 10 TB and if that node fails then it needs to move that data to the other two nodes.

I was also looking at "Usage" recently, but didn't get too far into it...seems like the GUI is a bit inconsistent in whether it's showing total space allocated or space actually used but I suspect that's intentional depending on where one is looking.

You can mark OSDs out and/or down in the GUI.