NFS VM over Ceph

czechsys · Aug 23, 2019

Hi,

i am testing performance PVE6 with ceph with 2 nodes (2/2) and 2 ssd OSDs per node, network is shared 1x10Gbps for all traffic on PVEs (just for small test).

HW: 2x HP DL380p G8, every node has 2x E5-2690 2.9GHz, 96GB RAM, SmartArray P430, 1x Intel S4610 2TB, 1x Kingston DC500m, 10Gbps optical. PVE is installed on dedicated ssds. Ceph OSDs are as raid0 no P430 - 1x ssd per bluestore OSD. So 4x OSDs total on 2 nodes 2/2 replica.

Now, some ceph fio tests from VM (Debian10 defaults), all VMs on same host:
1x VM: read 28k, write 10k, readwrite 14k/5k iops <-- thats acceptalbe
1x VM (nfs server) + 1x VM (nfs client): read 11.1k, write 0.8k, readwrite 1.8k/0.6k iops (from client) <-- this looks very bad

For example drbd (VMs as raw files):
1x VM: read 45k, write 21k, readwrite 28/10 <-- thats semiexpected
1x VM (nfs server) + 1x VM (nfs client): read 31k, write 4.8k, readwrite 11.6/3.8k (from client) <-- thats acceptable

I need shared fs (at least samba, nfs is better option), but those tests are showing very bad iops for writes from nfs client to nfs server on ceph...any idea, how to make it better? I have feeling that even samba will be very slow on ceph...

Thanks.

Alwin · Aug 26, 2019

czechsys said:
HW: 2x HP DL380p G8, every node has 2x E5-2690 2.9GHz, 96GB RAM, SmartArray P430, 1x Intel S4610 2TB, 1x Kingston DC500m, 10Gbps optical. PVE is installed on dedicated ssds. Ceph OSDs are as raid0 no P430 - 1x ssd per bluestore OSD. So 4x OSDs total on 2 nodes 2/2 replica.

The hardware is sub-optimal, see the precondition (link). And (any) RAID is a no-go for Ceph (or ZFS).
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

czechsys said:
Now, some ceph fio tests from VM (Debian10 defaults), all VMs on same host:
1x VM: read 28k, write 10k, readwrite 14k/5k iops <-- thats acceptalbe
1x VM (nfs server) + 1x VM (nfs client): read 11.1k, write 0.8k, readwrite 1.8k/0.6k iops (from client) <-- this looks very bad

For example drbd (VMs as raw files):
1x VM: read 45k, write 21k, readwrite 28/10 <-- thats semiexpected
1x VM (nfs server) + 1x VM (nfs client): read 31k, write 4.8k, readwrite 11.6/3.8k (from client) <-- thats acceptable

Ceph and DRBD work differently, it is like comparing apples with pears. Eg. Ceph has no locality, contrary to DRBD.

czechsys said:
I need shared fs (at least samba, nfs is better option), but those tests are showing very bad iops for writes from nfs client to nfs server on ceph...any idea, how to make it better? I have feeling that even samba will be very slow on ceph...

As said above, the hardware is sub-optimal and it will not perform in the same way with Ceph as with DRBD. While you can add caches and tune NFS/SMB, I'd recommend to test with ZFS and the storage replication (might work better) or use DRBD (linbit released a new plugin).

czechsys · Aug 26, 2019

Thank you for your reply. If we exclude Raid card (no HBA available for now), what parts of hardware is suboptimal? I checked those links, enough RAM, fast SSDs + extra ssd for OS, CPU is near top frequency with enough cores, multiple 10G links in reserve...I know, low number of hosts/OSDs, but if we exclude number of OSDs, realy is the HW of the node supotimal?

I know difference between drbd/ceph/zfs, we are running drbd clusters. ZFS is no go for us due missing realtime replication, we will test linbit plugin for proxmox, but need to know, if NFS will be fast enough on ceph before we buy more nodes, because on 2/2 nodes the result is very low.
Tests with ganesha-nfs are on tasklist too.

Alwin · Aug 26, 2019

czechsys said:
If we exclude Raid card (no HBA available for now), what parts of hardware is suboptimal? I checked those links, enough RAM, fast SSDs + extra ssd for OS, CPU is near top frequency with enough cores, multiple 10G links in reserve...I know, low number of hosts/OSDs, but if we exclude number of OSDs, realy is the HW of the node supotimal?

You can not just exclude parts of the systems and ask if it would then be ok/better. It needs all components.

But you stated already the two apparent points, that I was referring to. Further I would add that the latency might not be on bar with current CPU/memory generation (G8 was retired in 2015, AFAIR). Ceph is sensible to it and anything lowering the latency is good. As, eg. using all memory channels available.

Another point to watch out for is the latency of the two different models of SSD, as there are only two OSDs on two nodes, they might introduce latency spikes (one being slower than the other).

Depending on workload most of them might be negligible (not the RAID though). That's why I referred it to be sub-optimal. But I understand that sometimes you have to use what you have at hand.

czechsys said:
I know difference between drbd/ceph/zfs, we are running drbd clusters. ZFS is no go for us due missing realtime replication,

Yes, ZFS is not meant for synchronous replication. For anybody reading this, yes, Ceph or DRBD (Protocol C) are doing synchronous writes.

czechsys said:
we will test linbit plugin for proxmox, but need to know, if NFS will be fast enough on ceph before we buy more nodes, because on 2/2 nodes the result is very low.
Tests with ganesha-nfs are on tasklist too.

Ceph will be slower, as it doesn't take any locality into account and any write is only considered ACK if it has been written to the secondary OSD too. With only two OSD per node, any write has to hit a OSD on either node. Alone for data safety you need to use (buy) a third node to have at least 3x copies. If one server fails, it will only have one copy left, any OSD failure on the remaining node will lose data. More OSDs with more nodes will speed up Ceph, as a single OSD gets less hits.

czechsys · Aug 26, 2019

Can anybody please make a small test if small PVE6 ceph cluster with SSDs available? Ideally 3/2 setup with 3 or 4 hosts.
Write your specs - server type, CPUs, RAM, OSDs, 10Gbps network

For example 2x 10Gb Debian 10 VMs:

1] fio from 1x VM

Code:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randread
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

2]
First VM
a] install nfs-kernel-server
b] /etc/exports

Code:

/srv/nfs network/24 (rw,fsid=0,sync,subtree_check,root_squash,crossmnt,sec=sys)

c] create /srv/nfs/test
d]

Code:

exportfs -av

Second VM
e] install nfs-common
f]

Code:

mkdir /mnt/nfs

g]

Code:

mount -t nfs4 first_server_ip:/srv/nfs /mnt/nfs

h]

Code:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randread
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

Thanks.

Search

Search

NFS VM over Ceph

czechsys

Renowned Member

Alwin

Proxmox Retired Staff

czechsys

Renowned Member

Alwin

Proxmox Retired Staff

czechsys

Renowned Member