NFS VM over Ceph

czechsys

Renowned Member
Nov 18, 2015
419
43
93
Hi,

i am testing performance PVE6 with ceph with 2 nodes (2/2) and 2 ssd OSDs per node, network is shared 1x10Gbps for all traffic on PVEs (just for small test).

HW: 2x HP DL380p G8, every node has 2x E5-2690 2.9GHz, 96GB RAM, SmartArray P430, 1x Intel S4610 2TB, 1x Kingston DC500m, 10Gbps optical. PVE is installed on dedicated ssds. Ceph OSDs are as raid0 no P430 - 1x ssd per bluestore OSD. So 4x OSDs total on 2 nodes 2/2 replica.

Now, some ceph fio tests from VM (Debian10 defaults), all VMs on same host:
1x VM: read 28k, write 10k, readwrite 14k/5k iops <-- thats acceptalbe
1x VM (nfs server) + 1x VM (nfs client): read 11.1k, write 0.8k, readwrite 1.8k/0.6k iops (from client) <-- this looks very bad

For example drbd (VMs as raw files):
1x VM: read 45k, write 21k, readwrite 28/10 <-- thats semiexpected
1x VM (nfs server) + 1x VM (nfs client): read 31k, write 4.8k, readwrite 11.6/3.8k (from client) <-- thats acceptable

I need shared fs (at least samba, nfs is better option), but those tests are showing very bad iops for writes from nfs client to nfs server on ceph...any idea, how to make it better? I have feeling that even samba will be very slow on ceph...

Thanks.
 
HW: 2x HP DL380p G8, every node has 2x E5-2690 2.9GHz, 96GB RAM, SmartArray P430, 1x Intel S4610 2TB, 1x Kingston DC500m, 10Gbps optical. PVE is installed on dedicated ssds. Ceph OSDs are as raid0 no P430 - 1x ssd per bluestore OSD. So 4x OSDs total on 2 nodes 2/2 replica.
The hardware is sub-optimal, see the precondition (link). And (any) RAID is a no-go for Ceph (or ZFS).
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

Now, some ceph fio tests from VM (Debian10 defaults), all VMs on same host:
1x VM: read 28k, write 10k, readwrite 14k/5k iops <-- thats acceptalbe
1x VM (nfs server) + 1x VM (nfs client): read 11.1k, write 0.8k, readwrite 1.8k/0.6k iops (from client) <-- this looks very bad

For example drbd (VMs as raw files):
1x VM: read 45k, write 21k, readwrite 28/10 <-- thats semiexpected
1x VM (nfs server) + 1x VM (nfs client): read 31k, write 4.8k, readwrite 11.6/3.8k (from client) <-- thats acceptable
Ceph and DRBD work differently, it is like comparing apples with pears. Eg. Ceph has no locality, contrary to DRBD.

I need shared fs (at least samba, nfs is better option), but those tests are showing very bad iops for writes from nfs client to nfs server on ceph...any idea, how to make it better? I have feeling that even samba will be very slow on ceph...
As said above, the hardware is sub-optimal and it will not perform in the same way with Ceph as with DRBD. While you can add caches and tune NFS/SMB, I'd recommend to test with ZFS and the storage replication (might work better) or use DRBD (linbit released a new plugin).
 
Thank you for your reply. If we exclude Raid card (no HBA available for now), what parts of hardware is suboptimal? I checked those links, enough RAM, fast SSDs + extra ssd for OS, CPU is near top frequency with enough cores, multiple 10G links in reserve...I know, low number of hosts/OSDs, but if we exclude number of OSDs, realy is the HW of the node supotimal?

I know difference between drbd/ceph/zfs, we are running drbd clusters. ZFS is no go for us due missing realtime replication, we will test linbit plugin for proxmox, but need to know, if NFS will be fast enough on ceph before we buy more nodes, because on 2/2 nodes the result is very low.
Tests with ganesha-nfs are on tasklist too.
 
If we exclude Raid card (no HBA available for now), what parts of hardware is suboptimal? I checked those links, enough RAM, fast SSDs + extra ssd for OS, CPU is near top frequency with enough cores, multiple 10G links in reserve...I know, low number of hosts/OSDs, but if we exclude number of OSDs, realy is the HW of the node supotimal?
You can not just exclude parts of the systems and ask if it would then be ok/better. It needs all components.

But you stated already the two apparent points, that I was referring to. Further I would add that the latency might not be on bar with current CPU/memory generation (G8 was retired in 2015, AFAIR). Ceph is sensible to it and anything lowering the latency is good. As, eg. using all memory channels available.

Another point to watch out for is the latency of the two different models of SSD, as there are only two OSDs on two nodes, they might introduce latency spikes (one being slower than the other).

Depending on workload most of them might be negligible (not the RAID though). That's why I referred it to be sub-optimal. But I understand that sometimes you have to use what you have at hand. :cool:

I know difference between drbd/ceph/zfs, we are running drbd clusters. ZFS is no go for us due missing realtime replication,
Yes, ZFS is not meant for synchronous replication. For anybody reading this, yes, Ceph or DRBD (Protocol C) are doing synchronous writes.

we will test linbit plugin for proxmox, but need to know, if NFS will be fast enough on ceph before we buy more nodes, because on 2/2 nodes the result is very low.
Tests with ganesha-nfs are on tasklist too.
Ceph will be slower, as it doesn't take any locality into account and any write is only considered ACK if it has been written to the secondary OSD too. With only two OSD per node, any write has to hit a OSD on either node. Alone for data safety you need to use (buy) a third node to have at least 3x copies. If one server fails, it will only have one copy left, any OSD failure on the remaining node will lose data. More OSDs with more nodes will speed up Ceph, as a single OSD gets less hits.
 
  • Like
Reactions: fbifido
Can anybody please make a small test if small PVE6 ceph cluster with SSDs available? Ideally 3/2 setup with 3 or 4 hosts.
Write your specs - server type, CPUs, RAM, OSDs, 10Gbps network

For example 2x 10Gb Debian 10 VMs:

1] fio from 1x VM
Code:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randread
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

2]
First VM
a] install nfs-kernel-server
b] /etc/exports
Code:
/srv/nfs network/24 (rw,fsid=0,sync,subtree_check,root_squash,crossmnt,sec=sys)
c] create /srv/nfs/test
d]
Code:
exportfs -av
Second VM
e] install nfs-common
f]
Code:
mkdir /mnt/nfs
g]
Code:
mount -t nfs4 first_server_ip:/srv/nfs /mnt/nfs
h]
Code:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randread
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

Thanks.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!