Proxmox + ZFS + Gluster

jinjer

Renowned Member
Oct 4, 2010
204
7
83
I'd like to share some experience with you regarding the use of proxmox with zfsonlinux and glusterfs for hosting KVM and OpenVZ images.

The good things is that it works, but using zfsonlinux is a somewhat bittersweet experience.

I love zfs and the way it works on solaris kernels, but the port to linux is a little on the rough side. Gluster adds to the mix by taxing the fs with long (unnecesary so) xattr that cause more seeks than necessary with ZFS and all other filesystems where except XFS where you can set a bigger xattr block.

However, my testing indicates that XFS is generally slower than ZFS (tuned properly) for most tasks related to hosting KVM images, except in one area where ZFS+Gluster is a pain: File creation.

File creation is so slow, that it's barely usable: My bonnie tests show sequential create at 50 files/sec, while random drop to about half of that. It beats me, as the ZFS underneath can handle 15K+ file creations/sec easily.

Jinjer.
 
yes, I use xattr=sa and there is roughly a 1:2 penalty for setting it to on.

I made a small program to list xattr from files and mine are barely under the 128 byte limit for zfsonlinux, so I think I'm saving one seek per file creation.

I did a multi-threaded bonnie run on four nodes with 1 4 8 and 10 threads per node. It scales linearly up to 4 threads per node and then starts to saturate. 10 threads per node kills the nodes almost completely. The total combined file creation rate for the cluster is about 1500 files/sec, but no more than 50-60 per thread.

I am planning to use part of this storage for a dovecot mail spool and file creation rate is important, as is the ability to recover snapshots from the underlying zfs filesystem.

Perhaps I cannot ask for more (dual bonded 1G nics and 4 x sata drives per node).

What does your hardware look like?

jinjer
 
Everything I have is mostly consumer grade hardware, this is just a cluster for my personal use at home. One node is a Core i7 920, 16GB of ram, with three 3TB WD Reds. The other node is a Phenom 1090T, 16GB of RAM, with three 3TB WD Reds. They are interconnected with a 10GbE NIC. Each node contains a pool that is a stripe of it's three Reds, and those pools have two zfs filesystems which are the bricks for two different replicated gluster volumes.

I did a quick file creation test against my volume that hosts my proxmox vm images, I got about 290 file creates per second from a single thread:
root@proxmox01:/mnt/pve/vmstore01/test# time touch test{1..1000}



real 0m3.457s
user 0m0.009s
sys 0m0.057s

My other gluster volume does not perform as well:
root@proxmox01:/mnt/pve/samba01/test# time touch test{1..1000}


real 0m11.155s
user 0m0.011s
sys 0m0.072s


Frankly, I'm still not quite sure why the difference. If I configure my slower samba volume with the exact same gluster settings as the vmstore, it still performs slower (even though both are hitting the exact same zpool).


If it helps, here are my non-default settings for the zfs pools:

AME PROPERTY VALUE SOURCE
pool01 compression on local
pool01 atime off local
pool01 refreservation 1T local
pool01/samba01 compression on inherited from pool01
pool01/samba01 atime off inherited from pool01
pool01/samba01 xattr sa local
pool01/vmstore01 compression on inherited from pool01
pool01/vmstore01 atime off inherited from pool01
pool01/vmstore01 xattr sa local

And likewise, here are my two gluster volumes:

Volume Name: samba01
Type: Replicate
Volume ID: fa21d785-bb52-4063-b435-429c5efead40
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: proxmox01-san:/pool01/samba01
Brick2: proxmox02-san:/pool02/samba01
Options Reconfigured:
performance.io-thread-count: 64
performance.force-readdirp: on
performance.stat-prefetch: on
performance.write-behind: on
performance.flush-behind: off
performance.read-ahead: off
nfs.disable: off


Volume Name: vmstore01
Type: Replicate
Volume ID: 94194ae6-752a-4c24-baff-cc005d4cabc4
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: proxmox01-san:/pool01/vmstore01
Brick2: proxmox02-san:/pool02/vmstore01
Options Reconfigured:
network.remote-dio: on
nfs.disable: on
performance.read-ahead: off
performance.flush-behind: off
performance.write-behind: off
performance.force-readdirp: off
 
Thank you for sharing your details. I think that having a 10G network is a good start, but something must be very wrong in my setup:

time touch test{1..1000}


real 0m30.611s
user 0m0.005s
sys 0m0.094s



I'll try tweaking some of the gluster pool parameters.

The script I'm using for testing is based on bonnie++
Code:
# cat bench.sh#!/bin/bash


bonnie++ -s 0 -u nobody -d t -f -n 16:65536:1024:8:8192 -z 0 -p4&


bonnie++ -s 0 -u nobody -d t -f -n 16:65536:1024:8:8192 -z 0 -ys >> $(hostname).out1&
bonnie++ -s 0 -u nobody -d t -f -n 16:65536:1024:8:8192 -z 0 -ys >> $(hostname).out2&
bonnie++ -s 0 -u nobody -d t -f -n 16:65536:1024:8:8192 -z 0 -ys >> $(hostname).out3&
bonnie++ -s 0 -u nobody -d t -f -n 16:65536:1024:8:8192 -z 0 -ys >> $(hostname).out4&
#bonnie++ -s 0 -u nobody -d t -f -n 16:65536:1024:8:8192 -z 0 -ys >> $(hostname).out5&
#bonnie++ -s 0 -u nobody -d t -f -n 16:65536:1024:8:8192 -z 0 -ys >> $(hostname).out6&
#bonnie++ -s 0 -u nobody -d t -f -n 16:65536:1024:8:8192 -z 0 -ys >> $(hostname).out7&
#bonnie++ -s 0 -u nobody -d t -f -n 16:65536:1024:8:8192 -z 0 -ys >> $(hostname).out8&
#bonnie++ -s 0 -u nobody -d t -f -n 16:65536:1024:8:8192 -z 0 -ys >> $(hostname).out9&
#bonnie++ -s 0 -u nobody -d t -f -n 16:65536:1024:8:8192 -z 0 -ys >> $(hostname).out10&

You can decide the number of threads (-pN) and then uncomment the relevant number of bonnies under.
You can login on both nodes, put the script on the pool and create a temporary directory 't' owned by nobody.

The start the script at the same time on both nodes and wait for completion.