Proxmox 4.0 ZFS disk configuration with L2Arc + Slog. 2nd opinion

sapphiron

Well-Known Member
Nov 2, 2012
30
0
46
Hi All

BACKGROUND
We currently run a PVE 3.4 on a Xeon E3 with 32GB of RAM and 4x 3TB WD RE4 sata drives with MDADM RAID 10 and LVM.
We have a small multi-tenanted operation and we all share this server for a variety of software. A little bit on everything, from Windows File and WSUS server, Jboss/MySQL web application, unifi wifi controller, Pervasive based Accounting system, etc. We have 12 VM's in Total. OS's are Win Server 2008 R2 and 2012 R2 and the rest are Ubuntu 14.04

WHAT IS WORKING WELL
1. Stability, it has been running nearly without interruption for 2 years. reached 250 days of uptime at one point.
2. non-disk performance, applications not needing aggressive disk access, performance is excellent.
3. Backups are pushed to a NFS server on a freenas box on the same lan, but in a different building. both proxmox and traditional backups

WHAT IS NOT WORKING WELL
1. Disk performance. It started out well and the raw disk performance numbers are still good, but the nature of the workload, is causing some performance issues with random workloads. We have had to resize disks quite a bit over the last 2 years, so I suspect fragmentation to be the reason. IO-wait numbers are around 4% average
2. Load average is around 4.0 average, with a peak time average of about 6 to 8.
3. most of my ram is allocated to try to limit the disk requirements. combined VM allocation is about 29GB. currently on 25GB used

FUTURE PLANS
1. I plan to re-install with Proxmox 4.0. A Clean install.
2. I will be backing up all PVE 3.4 VM's to NFS server.
3. The Proxmox will be installed on a dedicated 64GB SSD with EXT4 default partitioning.
4. We are adding 2 more 3TB WD RE4 drives for a total of 6. To add more storage and provide a bit more disk performance.
5. I am considering adding 2 SSD's into the mix to act as ZFS L2Arc and Zil+Slog. more on this configuration below. The SSD's we plan to use are Samsung MZ-7KM480Z480Gb SM863 480GB models. They have a write endurance of 3080 TB with built-in ECC and power loss protection. Along with $300 pricetag in my region. Finally an SSD, I can expect to last 5 years in this sort of workload.
6. I plan to limit the ZFS Arch to 4GB, as that is as much as I can spare without taking RAM away from the VM's. I know I need more for ideal performance, but it the additional RAM will only be availbable in 12 months time.
7. We do plan to replace the board, Processor with Something that will take more RAM at the end of 2016.

PLANNED DISK CONFIGURATION
1. 1x 64GB "consumer" SSD for Proxmox OS and ISO images only
2. 6x 3TB WD RE4 7200RPM disk in ZFS "RAID10".
- Connected to Highpoint PCI-E 8x HBA
3. 2x Samsung SM863 480GB SSDs for L2Arch and Zil+Slog
- connected to onboard SATA via AHCI
- each SSD partitioned into 2 ZFS partitions. 10GB and 470GB
- with a 5 second buffer flush interval and the SSD being capable of a max of 500Mb/s write. I figured that a 10GB partition is about 4 times larger than needed, but I would rather play it safe.
- the 2x 10GB partitions will mirrored and used for Zil+Slog
- the 2x 470GB partitions will be added as cache disks to provide 940GB of L2Arc. I thought that mirroring L2Arc would be unnecessary, as the data is check-summed. I would rather have a larger L2Arc

PLAN B on ZFS
I have a whole week to test the configuration while our offices close in December, so if I am not happy with the ZFS setup, I plan to use the 2 SSD's as mirrored storage for use with some of the VM's, for their OS and Database Virtual disks. In that case, i would probably stick with MDADM and LVM.

QUESTIONS
1. Anything I need to do to the VM's before I take the PVE 3.4 server down? Update the VirtIO drivers on the windows VM's? Is there a good guide somewhere for how to do it?
2. Anything I will be losing by going from LVM to ZFS from a Proxmox point of view?
3. does Proxmox 4 use Zvol's by default, or does it store ?
4. Should I consider formatting the ZFS block devices with EXT4 or XFS, so I can use Qcow2 disk images? Some data seems to indicate that the performance is still great and the flexibility of Qcow2 is worth the extra complexity.
5. ZFS litrature seems to recommend always enabling LZ4 compression by default. Is there any benefit in a system where disk space is pre-allocated?

REFERENCE MATERIAL
https://blogs.oracle.com/brendan/entry/test
https://clinta.github.io/FreeNAS-Multipurpose-SSD/
https://pthree.org/2013/01/03/zfs-administration-part-xvii-best-practices-and-caveats/

SHOOT AWAY!, any input, experience, or even questions would be appreciated.
 
Last edited:
Short answer: no. You will need RAM.

Hi All

BACKGROUND
We currently run a PVE 3.4 on a Xeon E3 with 32GB of RAM and 4x 3TB WD RE4 sata drives with MDADM RAID 10 and LVM.
We have a small multi-tenanted operation and we all share this server for a variety of software. A little bit on everything, from Windows File and WSUS server, Jboss/MySQL web application, unifi wifi controller, Pervasive based Accounting system, etc. We have 12 VM's in Total. OS's are Win Server 2008 R2 and 2012 R2 and the rest are Ubuntu 14.04

WHAT IS WORKING WELL
1. Stability, it has been running nearly without interruption for 2 years. reached 250 days of uptime at one point.
2. non-disk performance, applications not needing aggressive disk access, performance is excellent.
3. Backups are pushed to a NFS server on a freenas box on the same lan, but in a different building. both proxmox and traditional backups

WHAT IS NOT WORKING WELL
1. Disk performance. It started out well and the raw disk performance numbers are still good, but the nature of the workload, is causing some performance issues with random workloads. We have had to resize disks quite a bit over the last 2 years, so I suspect fragmentation to be the reason. IO-wait numbers are around 4% average
2. Load average is around 4.0 average, with a peak time average of about 6 to 8.
3. most of my ram is allocated to try to limit the disk requirements. combined VM allocation is about 29GB. currently on 25GB used

FUTURE PLANS
1. I plan to re-install with Proxmox 4.0. A Clean install.
2. I will be backing up all PVE 3.4 VM's to NFS server.
3. The Proxmox will be installed on a dedicated 64GB SSD with EXT4 default partitioning.
4. We are adding 2 more 3TB WD RE4 drives for a total of 6. To add more storage and provide a bit more disk performance.
5. I am considering adding 2 SSD's into the mix to act as ZFS L2Arc and Zil+Slog. more on this configuration below. The SSD's we plan to use are Samsung MZ-7KM480Z480Gb SM863 480GB models. They have a write endurance of 3080 TB with built-in ECC and power loss protection. Along with $300 pricetag in my region. Finally an SSD, I can expect to last 5 years in this sort of workload.
6. I plan to limit the ZFS Arch to 4GB, as that is as much as I can spare without taking RAM away from the VM's. I know I need more for ideal performance, but it the additional RAM will only be availbable in 12 months time.
7. We do plan to replace the board, Processor with Something that will take more RAM at the end of 2016.

PLANNED DISK CONFIGURATION
1. 1x 64GB "consumer" SSD for Proxmox OS and ISO images only
2. 6x 3TB WD RE4 7200RPM disk in ZFS "RAID10".
- Connected to Highpoint PCI-E 8x HBA
3. 2x Samsung SM863 480GB SSDs for L2Arch and Zil+Slog
- connected to onboard SATA via AHCI
- each SSD partitioned into 2 ZFS partitions. 10GB and 470GB
- with a 5 second buffer flush interval and the SSD being capable of a max of 500Mb/s write. I figured that a 10GB partition is about 4 times larger than needed, but I would rather play it safe.
- the 2x 10GB partitions will mirrored and used for Zil+Slog
- the 2x 470GB partitions will be added as cache disks to provide 940GB of L2Arc. I thought that mirroring L2Arc would be unnecessary, as the data is check-summed. I would rather have a larger L2Arc

- your very small ARC will be killed by 940GB of L2ARC anyway
- search for arc_summary script and check your ARC hit ratio. I bet it is very very small.
- outside the VMs (29GB) you are left with 3GB for: OS (all kinds of buffers), ARC (for both data AND metadata), ZFS buffers (free space maps being the most important thing for writes). This will simply not work.

PLAN B on ZFS
I have a whole week to test the configuration while our offices close in December, so if I am not happy with the ZFS setup, I plan to use the 2 SSD's as mirrored storage for use with some of the VM's, for their OS and Database Virtual disks. In that case, i would probably stick with MDADM and LVM.

QUESTIONS
1. Anything I need to do to the VM's before I take the PVE 3.4 server down? Update the VirtIO drivers on the windows VM's? Is there a good guide somewhere for how to do it?
I think you can simply insert the virtio ISO and use "update driver"?
2. Anything I will be losing by going from LVM to ZFS from a Proxmox point of view?
3. does Proxmox 4 use Zvol's by default, or does it store ?
1 zvol for each virtual disk in PVE4.
4. Should I consider formatting the ZFS block devices with EXT4 or XFS, so I can use Qcow2 disk images? Some data seems to indicate that the performance is still great and the flexibility of Qcow2 is worth the extra complexity.
You have greate performance and flexibility with ZVOL (even TRIM/UNMAP with VIRTIO-SCSI which is not easy done on qcow2).
5. ZFS litrature seems to recommend always enabling LZ4 compression by default. Is there any benefit in a system where disk space is pre-allocated?
No. It starts to compress writes from the moment you enable it.
 
Last edited:
Short answer: no. You will need RAM.

- your very small ARC will be killed by 940GB of L2ARC anyway
- search for arc_summary script and check your ARC hit ratio. I bet it is very very small.
- outside the VMs (29GB) you are left with 3GB for: OS (all kinds of buffers), ARC (for both data AND metadata), ZFS buffers (free space maps being the most important thing for writes). This will simply not work.
I currently have a DB that runs 100% from ram, so I can free up an additional 6GB that way.
What about if I split the SSDs partitions 3 way, to make the RAM requirements more conservative?
1. 10GB - Zil partition mirrored - 10GB
2- 40GB - L2Arch - 40GB (80GB combined)
3. ~400GB - mirrored pool of pure SSD storage. I would then store that DB on a Virtual disk stored on this pool.
 
Yes. If you restore on a compression-enabled dataset/zvol, it will work. Let's put it this way: if you have compressible data, there is no reason to not enable LZ4 pool-wide from the beginning.
 
LZ4 compression will also increase performance since writes will be made in chunks using full disk block size. This also should help improve the live time of your disks since writes will be reduced to the minimum.
 
Hi,

For L2ARC, but carefull of the block size you use, because you need around 400Bytes of memory (ram) for each block in L2ARC.

(for example if you use 4K block size, with 960GB L2ARC, you need 96GB ram just to handle that).
 
Enable what modes? There have been multiple topics here.
primary and secondary cache settings control what goes in ARC and L2ARC (nothing, metadata, metadata+data).
 
Some notes (as i am not a person to ask about ZFS indepth)

1. you mentioned you have load averages from 4.0 to 8.0. Since the Xeon E3 afaik (you didn't mention which cpu) is a Quadcore + Hyper-threading that would mean it utilizes your CPU to 50% and sometimes 75-100%. Compare http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages


2. You mentioned you wanna stick a 64 GB Consumer SSD in there. Consider (if you have the ports) to stick 2x Consumer SSD in there and run em on Raid-1 or mirrored Vdevs (ZFS Raid1) as those consumer SSD's have a tendency to die without any warning. I personally do ZFS Raid1 on my machines (70+ Nodes running that way with small size consumer SSD's)
 
Last edited:
Some notes (as i am not a person to ask about ZFS indepth)

2. You mentioned you wanna stick a 64 GB Consumer SSD in there. Consider (if you have the ports) to stick 2x Consumer SSD in there and run em on Raid-1 or mirrored Vdevs (ZFS Raid1) as those consumer SSD's have a tendency to die without any warning. I personally do ZFS Raid1 on my machines (70+ Nodes running that way with small size consumer SSD's)

we are careful about which SSD brand we use. We stick with Samsung, Intel or Kingston usually, and never buy the entry level product. We have never had one fail with advanced notice via SMART. of the +- 50 SSD's we have, only an intel 530 unit has failed. It gave us ample warning via smart.

I will need to have a look as ZFS RAID1 then, because hardware RAID and Linux MDADM stability have been very poor for us. I think hardware raid was poor the controllers pre-date SSD's and MDADM is poor in general when used as boot drives.

Our experience has been that we had more downtime caused by unbootable RAID, than we have had due to any disk or SSD related failure. Our strategy should an OS SSD fail, is to replace the SDD with one of our cold spares, load up a clonezilla ISO via IPMI restore a known good image backups and the config files vir script. We have recovery down to 8min from the time we replace the SSD.
 
[...] We have never had one fail with advanced notice via SMART. of the +- 50[...]
we have some 3k+ ssd's (adata / kingston / sandisk / samsung / transcent). Failure Rates (without Smart indications) are about 2% after a year and around 8% after 2 years of 24/7. Doesn't really matter what vendor, its marginally different. Where it makes a big difference is detection of about to fail drives via Smart.

[...] clonezilla ISO via IPMI restore a known good image backups and the config files via script. We have recovery down to 8min from the time we replace the SSD.

Sounds interesting. And actually more economical (in all respects). I've read over some of the guides for clonezilla Server edition so far. 8 Minutes is fast enough.

a bit OT: Can you install it in a VM (secondary Proxmox Cluster), then use a vlan-tagged network connected to the primary proxmox cluster nodes and back up your OS-SSD's - lets say once every hour (keep 24 copies), and once per day (keep 30 copies) - for lets say 60+ nodes ? (i am aware of the bandwith/space constraints) - automatically ??
 
Last edited:
Sounds interesting. And actually more economical (in all respects). I've read over some of the guides for clonezilla Server edition so far. 8 Minutes is fast enough.a bit OT: Can you install it in a VM (secondary Proxmox Cluster), then use a vlan-tagged network connected to the primary proxmox cluster nodes and back up your OS-SSD's - lets say once every hour (keep 24 copies), and once per day (keep 30 copies) - for lets say 60+ nodes ? (i am aware of the bandwith/space constraints) - automatically ??
Its Ubuntu Kenel based, so I cant see a reason why it can run in a VM. The issues is as far as I know, clonezilla cant live backup an OS yet. We actually keep an SSD ready with a "base install" of Proxmox ready to go. We have all the config files backing up on each host via Rsync. Once we get the new SSD in, we simply drop the config files in place, reboot and add the host to cluster.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!