Filesystem and SSD optimizations for the next version of Proxmox

gkovacs

Renowned Member
Dec 22, 2008
514
51
93
Budapest, Hungary
After 1.5 years of running a small Proxmox cluster (started even before 1.0), we have identified a couple of problems that we would like to be solved in the next version, as we plan to move some of our systems to SSD based storage (but also would like the HDD based Proxmox nodes to become faster). It's important to note that the solutions identified below apply to (and enhance performance of) both HDD and SSD based setups, as well as hardware RAIDs of any kind of storage.

1. Ext4 / Btrfs filesystem
According to our analysis the current Ext3 filesystem is sufficient for general use, but very slow in maintenance tasks like fsck and snapshot-backup of a huge number of small files. It appears that Ext4 and Btrfs have mitigated many of these problems, as they're handling small files more efficiently, and maintenance like fsck takes only a fraction of time on them. For a long time this would not have been possible due to the kernel requirements of OpenVZ, but that wall is about to break down with the imminent release of 2.6.32 OpenVZ kernel, supporting both Ext4 and Btrfs.

We propose:
- a change of default filesystem to Ext4, and
- an option at Proxmox install to select from available filesystems
(even Btrfs as it's approaching stable status)

2. Noatime mount option
It's a widely held belief in Linux community that the biggest performance burden on disk IO is the access time flag, which effectively causes a small write for every file read operation. There is a solution to that problem: mounting the filesystem with the noatime or relatime flag. Theodore Tso (the developer of Ext4) advises against using relatime, as only noatime provides the full performance benefits that we need running multiple VM's on the same filesystem, be it on an SSD or HDD:
http://thunk.org/tytso/blog/2009/03/01/ssds-journaling-and-noatimerelatime/

We propose an install option for setting noatime for all the Proxmox filesystems.
We need clarification of noatime in both OpenVZ and KVM environments, as it's not clear if it needs to be set in guest filesystems as well.

3. Partition alignment
The world of 512-byte sectors is coming to an end: HDD manufacturers have agreed on the "Advanced format hard drive" specification, which is already available in new hard drives, sporting large 4K sectors. We already know that SSD's benefit tremendously from larger sectors, as flash memory is writable only in blocks of 128-512K. All these advances in IO performance are only available if partition boundaries are aligned to at least 4K, but preferably to 128-512K, as the biggest performance gains are expected when the partitions are aligned to both the RAID stripe size and the SSD block erase size. Unfortunately the default Linux partitioning scheme is not supporting alignment out of the box, so steps need to be taken:
http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/

We propose a modification of the Proxmox install regarding partitioning so it uses both:
- # fdisk -H 224 -S 56 /dev/sdx (or some other, even larger aligned H/S number)
- # pvcreate –metadatasize 250k /dev/sdX2
as explained in Theodore's post above (he even provides means for testing the alignment).
 
Last edited:
hi
your suggestions look sound.

question reagrding noatime - do kvm-only-hosts benefit from this also or is it mainly a winner for openvz hosting systems?

regards
hk
 
One more SSD/filesystem optimization came up since I've posted this, and it's well worth the addition:

4. Default I/O scheduler
For some reason, the default 'cfq' IO scheduler in Linux 2.6.24 is not optimal to the workload Proxmox presents (we have a mixed OpenVZ/KVM environment). If there is any disk activity, background tasks (like VZDump, but even a simple tar) slow to a halt. Also it was observed that during the nightly VZDump backups, our Apache and MySQL servers often timed out. We have solved these problems by changing the default I/O scheduler to 'deadline' on our single disk and RAID based systems, but on and SSD the optimal scheduler would be 'noop'. So we propose:

- changing the default I/O scheduler to 'deadline' in the Proxmox installer
- present an option to change it to 'noop' or any of the others should the user wishes

Of course further testing should be done on 2.6.32 regarding which scheduler works optimally on different storage subsystems. Most sources agree about that 'noop' is the best for SSD's and RAID controllers that present themselves only as a single device, and 'deadline' is also a good choice because it tries to limit latency for every IO request.

Further info can be found here:
http://forum.proxmox.com/threads/434-I-O-scheduler
http://tombuntu.com/index.php/2008/09/04/four-tweaks-for-using-linux-with-solid-state-drives/
http://www.fishpool.org/post/2008/03/31/Optimizing-Linux-I/O-on-hardware-RAID
http://wiki.hackalope.com/doku.php?id=linux_for_ssd
 
Last edited:
I use elevator=deadline at boot time, and I don't see any problems with my OpenVZ containers.
I don't use and don't know how to use the IO priorities you talk about...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!