Storage options for a (3) node cluster with local NVMe disks and 25G network.

Aptadmin

New Member
Aug 11, 2025
4
0
1
Hello All,

We purchased (3) boxes with (2) GOLD 6526Y CPUs, 512GB memory, 25G networking on ConnectX-5 cards, and (2) directly attached enterprise 6.4TB NVMe disks.

The machines are VERY fast locally, and the plan was to install ceph, but what is natively well north of 100k iops (4k random writes with cache disabled), turns into 1k iops in a VM. We went through all of the standard ceph pat your head and rub your belly looking at the bios performance modes, and turning every knob imaginable, and found that the performance isn't at all consistent enough to trust.

We have a lot of ZFS elsewhere in the network and are very comfortable with it, so I wonder if there is a way to deploy ZFS redundancy using this hardware. Is installing truenas scale in a base VM with elevated priority an option? What about RSF-1?

I'm aware of the ZFS 1-minute replication that many use, and while that sounds like it could work we do have some database and email workloads that I wouldn't want to lose 1 minute on a failure. I suppose we could cluster those services at the application level, but that would require a bit of effort and more cluster solutions to manage.

I'd like to get some insight from others that have slayed these dragons before me so I know which road to go bark up. Thanks!
 
Congratulations on your new environment. Before you can really make rational decisions for storage deployment you really need to consider what you want/need out of the equipment in a holistic manner.

1. What are you deploying on the equipment?
2. What are you performance floor requirements (compute/IO)
3. What are your RPO/RTO policies? (how much downtime are you willing to take)

Framing the question as "asking for insight" has no bearing for solutions that dont match your requirements.
 
1. Web server, mail server, database server, docker containers. General purpose stuff that requires decent IO performance.
2. I really want 15K iops or better in a virtual machine. CPU requirements are much less and my cpus are plenty fast enough to do both.
3. I need to survive one hypervisor failure without any data loss or loss of service.

I thought I pointed this out in my original post, which is mostly asking if there are some other storage technologies that work well on this hardware apart from ceph and 1 minute zfs replication given that ceph has terrible latency and 1 minute replication has too much possibility of data loss.
 
1. any storage back end should serve that stated requirement. be aware that meeting a performance spec defined as "decent" may have different meaning for different people.
2. 15k iops is doable. In your environment, lvm type will yield the best performance. As you did not mention any other required features that would be my advice.
3. so... no data loss is easier then no loss of service. The ONLY manner by which to guaranty no loss of service is have the service multiheaded, in which case the problem becomes not of fault tolerance but back end data synchronization. Look at the application itself for guidance on how to accomplish that effectively within your performance requirements.

If the intention is to facilitate live migration, PVE can live migrate from any block type to any block type, it will just need to copy the disk over before the final cutover.

If the intention is to survive a node failure with downtime- then yes, you will need ceph, different SDS, or a SAN. All of those options work. a longer rpo policy can be fulfilled with backup instant restoration as well.

Lastly- since you dont seem to have concrete performance requirements, I'd advise not really paying them much attention. snapshots (and all related features,) inline compression, deduplication, participation in your DR/BC policies, etc become of much greater importance- at least in my book.
 
  • Like
Reactions: Johannes S
im able to get above 10k write iops on my homelab with 3 1tb SSDs and 9700k, connectx3 cards within a vm

at ceph cluster at work its 200k+ within a vm

you should rub your head/pat your belly and diagnose a bit more
 
  • Like
Reactions: Johannes S
im able to get above 10k write iops on my homelab with 3 1tb SSDs and 9700k, connectx3 cards within a vm

at ceph cluster at work its 200k+ within a vm

you should rub your head/pat your belly and diagnose a bit more

This is helpful, can you tell me more about your home lab? You have (3) machines with 1TB SSDs, 9700k cpus, and connectx3 cards? How are they connected? 10G ethernet? Did you do anything special with ceph tuning or ip stack tuning? Do you know what IOPS you get directly on your SSDs from the hypervisor itself? Can you confirm which fio command you are running?

Here is benchmarking an rbd device on the hypervisor:

fio --name=random-write --rw=randwrite --bs=4k --numjobs=8 --size=256m --iodepth=64 --runtime=60 --time_based --end_fsync=1 --sync=1 --ioengine=libaio --group_reporting=1 --filename=/dev/rbd0
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
...
fio-3.39
Starting 8 processes
Jobs: 8 (f=8): [w(8)][100.0%][w=26.0MiB/s][w=6663 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=8): err= 0: pid=3829753: Mon May 18 15:32:30 2026
write: IOPS=6582, BW=25.7MiB/s (27.0MB/s)(1543MiB/60002msec); 0 zone resets
slat (usec): min=361, max=8670, avg=1211.80, stdev=255.08
clat (nsec): min=1248, max=96486k, avg=76504648.29, stdev=4610900.23
lat (usec): min=842, max=97685, avg=77716.45, stdev=4658.91
clat percentiles (usec):
| 1.00th=[65274], 5.00th=[69731], 10.00th=[70779], 20.00th=[72877],
| 30.00th=[74974], 40.00th=[76022], 50.00th=[77071], 60.00th=[78119],
| 70.00th=[79168], 80.00th=[80217], 90.00th=[82314], 95.00th=[83362],
| 99.00th=[86508], 99.50th=[86508], 99.90th=[88605], 99.95th=[89654],
| 99.99th=[92799]
bw ( KiB/s): min=20536, max=29344, per=100.00%, avg=26330.20, stdev=132.10, samples=960
iops : min= 5134, max= 7336, avg=6582.55, stdev=33.03, samples=960
lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.06%
lat (msec) : 100=99.91%
cpu : usr=0.34%, sys=1.42%, ctx=395100, majf=0, minf=1107
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=0,394953,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

This same benchmark in a VM is worse:

write: IOPS=570, BW=2281KiB/s (2336kB/s)(66.8MiB/30003msec); O zone resets

Benchmarking the actual nvme device gets me:

write: IOPS=361k, BW=1409MiB/s (1477MB/s)(82.5GiB/60001msec); 0 zone resets

I would think I could get more than 570 IOPS on this hardware in a VM.

I can boil the question down to:

Can I expect to get more than 10K iops in a VM using ceph on this hardware? If not, what are my other options?
 
This is helpful, can you tell me more about your home lab? You have (3) machines with 1TB SSDs, 9700k cpus, and connectx3 cards? How are they connected? 10G ethernet? Did you do anything special with ceph tuning or ip stack tuning? Do you know what IOPS you get directly on your SSDs from the hypervisor itself? Can you confirm which fio command you are running?

Here is benchmarking an rbd device on the hypervisor:

fio --name=random-write --rw=randwrite --bs=4k --numjobs=8 --size=256m --iodepth=64 --runtime=60 --time_based --end_fsync=1 --sync=1 --ioengine=libaio --group_reporting=1 --filename=/dev/rbd0
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
...
fio-3.39
Starting 8 processes
Jobs: 8 (f=8): [w(8)][100.0%][w=26.0MiB/s][w=6663 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=8): err= 0: pid=3829753: Mon May 18 15:32:30 2026
write: IOPS=6582, BW=25.7MiB/s (27.0MB/s)(1543MiB/60002msec); 0 zone resets
slat (usec): min=361, max=8670, avg=1211.80, stdev=255.08
clat (nsec): min=1248, max=96486k, avg=76504648.29, stdev=4610900.23
lat (usec): min=842, max=97685, avg=77716.45, stdev=4658.91
clat percentiles (usec):
| 1.00th=[65274], 5.00th=[69731], 10.00th=[70779], 20.00th=[72877],
| 30.00th=[74974], 40.00th=[76022], 50.00th=[77071], 60.00th=[78119],
| 70.00th=[79168], 80.00th=[80217], 90.00th=[82314], 95.00th=[83362],
| 99.00th=[86508], 99.50th=[86508], 99.90th=[88605], 99.95th=[89654],
| 99.99th=[92799]
bw ( KiB/s): min=20536, max=29344, per=100.00%, avg=26330.20, stdev=132.10, samples=960
iops : min= 5134, max= 7336, avg=6582.55, stdev=33.03, samples=960
lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.06%
lat (msec) : 100=99.91%
cpu : usr=0.34%, sys=1.42%, ctx=395100, majf=0, minf=1107
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=0,394953,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

This same benchmark in a VM is worse:

write: IOPS=570, BW=2281KiB/s (2336kB/s)(66.8MiB/30003msec); O zone resets

Benchmarking the actual nvme device gets me:

write: IOPS=361k, BW=1409MiB/s (1477MB/s)(82.5GiB/60001msec); 0 zone resets

I would think I could get more than 570 IOPS on this hardware in a VM.

I can boil the question down to:

Can I expect to get more than 10K iops in a VM using ceph on this hardware? If not, what are my other options?
try using the built in rados bench for ceph benchmarking

rados bench -p your_pool_name 60 write --no-cleanup -b 4K -t 16

for your info, my setup is mesh network 40g, mtu 9000 (connectx3 kind of suck because of higher latency, connectx4+ should be better at 4k iops)

here are my results on the homelab

Total time run: 60.0005
Total writes made: 1263862
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 82.282
Stddev Bandwidth: 6.16519
Max bandwidth (MB/sec): 89.2305
Min bandwidth (MB/sec): 63.082
Average IOPS: 21064
Stddev IOPS: 1578.29
Max IOPS: 22843
Min IOPS: 16149
Average Latency(s): 0.000758534
Stddev Latency(s): 0.000529636
Max latency(s): 0.0702235
Min latency(s): 0.000316572
 
Total time run: 60.0012
Total writes made: 910491
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 59.2755
Stddev Bandwidth: 1.37628
Max bandwidth (MB/sec): 62.0977
Min bandwidth (MB/sec): 55.8711
Average IOPS: 15174
Stddev IOPS: 352.328
Max IOPS: 15897
Min IOPS: 14303
Average Latency(s): 0.00105369
Stddev Latency(s): 0.000369505
Max latency(s): 0.030337
Min latency(s): 0.000355744

Settings for nic2:
Supported ports: [ Backplane ]
Supported link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: None RS BASER
Advertised link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: None
Speed: 25000Mb/s
Lanes: 1
Duplex: Full
Auto-negotiation: on
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Supports Wake-on: g
Wake-on: g
Link detected: yes

4: nic2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 88:e9:a4:3f:be:c4 brd ff:ff:ff:ff:ff:ff
altname enp59s0f0np0
altname enx88e9a43fbec4


My cpus 6526Y cpus are supposed to be 4% faster single core and 65% faster multicore than your 9700k. Do you think the issue is my switch? Do you think 40G networking would make a difference here? I'm using DAC cables, perhaps fiber matters?

Load avergage is only 1.4 doing this test, and the switch shows around 8k packets per second and 220mbps so I don't think more switch will make it much faster, but 40G does decrease ethernet latency.

Thanks again for the help, this does confirm ceph isn't working like it should on my hardware.
 
Total time run: 60.0012
Total writes made: 910491
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 59.2755
Stddev Bandwidth: 1.37628
Max bandwidth (MB/sec): 62.0977
Min bandwidth (MB/sec): 55.8711
Average IOPS: 15174
Stddev IOPS: 352.328
Max IOPS: 15897
Min IOPS: 14303
Average Latency(s): 0.00105369
Stddev Latency(s): 0.000369505
Max latency(s): 0.030337
Min latency(s): 0.000355744

Settings for nic2:
Supported ports: [ Backplane ]
Supported link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: None RS BASER
Advertised link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: None
Speed: 25000Mb/s
Lanes: 1
Duplex: Full
Auto-negotiation: on
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Supports Wake-on: g
Wake-on: g
Link detected: yes

4: nic2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 88:e9:a4:3f:be:c4 brd ff:ff:ff:ff:ff:ff
altname enp59s0f0np0
altname enx88e9a43fbec4


My cpus 6526Y cpus are supposed to be 4% faster single core and 65% faster multicore than your 9700k. Do you think the issue is my switch? Do you think 40G networking would make a difference here? I'm using DAC cables, perhaps fiber matters?

Load avergage is only 1.4 doing this test, and the switch shows around 8k packets per second and 220mbps so I don't think more switch will make it much faster, but 40G does decrease ethernet latency.

Thanks again for the help, this does confirm ceph isn't working like it should on my hardware.
i think this looks pretty normal if you are using a switch, for me inside a VM i can pretty much get near the same IOPS as the ceph pool itself, so it would be worth checking the VM disk config why you are getting 500 rather than 15,000

for recommended vm config for disk:

scsi controller: virtio scsi single

hard disk: scsi
tick checkboxes SSD emulation, IO thread, discard, cache=writeback
make sure virtio drivers are installed