After 1.5 years of running a small Proxmox cluster (started even before 1.0), we have identified a couple of problems that we would like to be solved in the next version, as we plan to move some of our systems to SSD based storage (but also would like the HDD based Proxmox nodes to become faster). It's important to note that the solutions identified below apply to (and enhance performance of) both HDD and SSD based setups, as well as hardware RAIDs of any kind of storage.
1. Ext4 / Btrfs filesystem
According to our analysis the current Ext3 filesystem is sufficient for general use, but very slow in maintenance tasks like fsck and snapshot-backup of a huge number of small files. It appears that Ext4 and Btrfs have mitigated many of these problems, as they're handling small files more efficiently, and maintenance like fsck takes only a fraction of time on them. For a long time this would not have been possible due to the kernel requirements of OpenVZ, but that wall is about to break down with the imminent release of 2.6.32 OpenVZ kernel, supporting both Ext4 and Btrfs.
We propose:
- a change of default filesystem to Ext4, and
- an option at Proxmox install to select from available filesystems (even Btrfs as it's approaching stable status)
2. Noatime mount option
It's a widely held belief in Linux community that the biggest performance burden on disk IO is the access time flag, which effectively causes a small write for every file read operation. There is a solution to that problem: mounting the filesystem with the noatime or relatime flag. Theodore Tso (the developer of Ext4) advises against using relatime, as only noatime provides the full performance benefits that we need running multiple VM's on the same filesystem, be it on an SSD or HDD:
http://thunk.org/tytso/blog/2009/03/01/ssds-journaling-and-noatimerelatime/
We propose an install option for setting noatime for all the Proxmox filesystems.
We need clarification of noatime in both OpenVZ and KVM environments, as it's not clear if it needs to be set in guest filesystems as well.
3. Partition alignment
The world of 512-byte sectors is coming to an end: HDD manufacturers have agreed on the "Advanced format hard drive" specification, which is already available in new hard drives, sporting large 4K sectors. We already know that SSD's benefit tremendously from larger sectors, as flash memory is writable only in blocks of 128-512K. All these advances in IO performance are only available if partition boundaries are aligned to at least 4K, but preferably to 128-512K, as the biggest performance gains are expected when the partitions are aligned to both the RAID stripe size and the SSD block erase size. Unfortunately the default Linux partitioning scheme is not supporting alignment out of the box, so steps need to be taken:
http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/
We propose a modification of the Proxmox install regarding partitioning so it uses both:
- # fdisk -H 224 -S 56 /dev/sdx (or some other, even larger aligned H/S number)
- # pvcreate –metadatasize 250k /dev/sdX2
as explained in Theodore's post above (he even provides means for testing the alignment).
1. Ext4 / Btrfs filesystem
According to our analysis the current Ext3 filesystem is sufficient for general use, but very slow in maintenance tasks like fsck and snapshot-backup of a huge number of small files. It appears that Ext4 and Btrfs have mitigated many of these problems, as they're handling small files more efficiently, and maintenance like fsck takes only a fraction of time on them. For a long time this would not have been possible due to the kernel requirements of OpenVZ, but that wall is about to break down with the imminent release of 2.6.32 OpenVZ kernel, supporting both Ext4 and Btrfs.
We propose:
- a change of default filesystem to Ext4, and
- an option at Proxmox install to select from available filesystems (even Btrfs as it's approaching stable status)
2. Noatime mount option
It's a widely held belief in Linux community that the biggest performance burden on disk IO is the access time flag, which effectively causes a small write for every file read operation. There is a solution to that problem: mounting the filesystem with the noatime or relatime flag. Theodore Tso (the developer of Ext4) advises against using relatime, as only noatime provides the full performance benefits that we need running multiple VM's on the same filesystem, be it on an SSD or HDD:
http://thunk.org/tytso/blog/2009/03/01/ssds-journaling-and-noatimerelatime/
We propose an install option for setting noatime for all the Proxmox filesystems.
We need clarification of noatime in both OpenVZ and KVM environments, as it's not clear if it needs to be set in guest filesystems as well.
3. Partition alignment
The world of 512-byte sectors is coming to an end: HDD manufacturers have agreed on the "Advanced format hard drive" specification, which is already available in new hard drives, sporting large 4K sectors. We already know that SSD's benefit tremendously from larger sectors, as flash memory is writable only in blocks of 128-512K. All these advances in IO performance are only available if partition boundaries are aligned to at least 4K, but preferably to 128-512K, as the biggest performance gains are expected when the partitions are aligned to both the RAID stripe size and the SSD block erase size. Unfortunately the default Linux partitioning scheme is not supporting alignment out of the box, so steps need to be taken:
http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/
We propose a modification of the Proxmox install regarding partitioning so it uses both:
- # fdisk -H 224 -S 56 /dev/sdx (or some other, even larger aligned H/S number)
- # pvcreate –metadatasize 250k /dev/sdX2
as explained in Theodore's post above (he even provides means for testing the alignment).
Last edited: