The question is: Is it safe or even recommended to use EXT4 as the storage file system (in my case local SATA storage, RAID-10) and/or install Proxmox with the boot parameter "linux ext4"?
We install Proxmox with linux=ext4 for several years now on all nodes in our cluster, over hardware RAID10 (without battery protection). Ext4 works very reliably, has survived many kernel panics and unplanned cold resets without any data corruption.
It also provides several advantages over ext3 that we observed:
- it is considerably faster during normal operation, especially on slower disk subsystems (single disk, mirror)
http://www.phoronix.com/scan.php?page=article&item=ext4_benchmarks&num=3
- it is orders of magnitude faster when doing recovery or filesystem-check (fsck) which can mean only minutes of downtime (instead of hours) after an unclean restart
http://en.wikipedia.org/wiki/File:E2fsck-uninit.svg
- much faster with large number of files and directories (speeding up vzdump / vzmigrate in our tests)
- much less fragmentation by design + online defragmentation
- larger maximum filesystem and file sizes
My second question is about the mount options for local ext4 storage on a SATA RAID-10 array with BBU (KVM images only): I noticed that it's possible to gain a quite huge performance boost from using "defaults,noatime,nodiratime,data=writeback,nobarrier,nobh,commit=10,nodelalloc" with the setup I mentioned (especially "nodelalloc" increased the fsyncs a lot and in addition I use "blockdev --setra 512 /dev/mapper/pve-data"). I came up with these settings after a while of research and I'm almost certain that it would work great on a Proxmox node, but I'm not abolutely sure about how secure and stable they are in a production environment. Would anyone be willing to try these settings in a similar testing environment or elaborate on any issues that might occur with these mount options?
We have also tested most of these mount options, but since the journal itself is the biggest factor in the reliability of ext4, have decided against using them, eventually using only noatime, nodiratime. For testing, we used bonnie++, as the fsync/s number reported by pveperf is far from being a comprehensive I/O benchmark.
data=writeback
According to the documentation, using data=writeback may lead to correctly written files that contain obsolete data (only metadata is written, file contents are not) after a crash, which would be unacceptable for us:
http://unix.stackexchange.com/quest...journal-vs-data-writeback-in-ext4-file-system
nobarrier
Nobarrier is also dangerous, especially if your RAID controller has no BBU and/or your server has no UPS:
http://www.phoronix.com/scan.php?page=article&item=ext4_linux35_tuning&num=1
nodelalloc
On the other hand, nodelalloc may be useful for data integrity (it may increase performance on many-disk arrays, but decreases it in single disks and mirrors, never tested it):
http://www.pointsoftware.ch/en/4-ext4-vs-ext3-filesystem-and-why-delayed-allocation-is-bad/
http://www.phoronix.com/scan.php?page=article&item=ext4_linux35_tuning&num=2
read-ahead
Setting read ahead should be done carefully, as it's very much dependent on your RAID stripe size and also your IO profile (how sequential is your average IO operation, etc.), so running bonnie++ with many possible settings is very useful before settling on a value. We didn't find a big effect on 6 disk hardware RAID10 compared to default, but on mdraid it may prove useful.
mdraid vs. hardware RAID
Not long ago the Proxmox kernel wasn't as stable as now, and a kernel panic and subsquent cold reset on a software RAID system did cause data corruption and even RAID collapse for us, even though our servers were UPS protected. Therefore if data integrity is important to you, you SHOULD use a hardware RAID controller that doesn't go down even when your kernel does. Together with a journaling filesystem and a UPS, it will provide data integrity even if your server crashes 5 times a day, because the controller and the disk will always complete the writes that they received from the OS, and the filesystem recovery will preseve the atomicity when recovering the journal - that's why a BBU is not really necessary if journaling is functioning properly.
Summary
To summarize it all, you should use ext4, it's safe and reliable, but messing with the default journaling options may be bad for data integrity, especially on software RAID or single disks. We chose data integrity instead of maximum performance, but on an affordable Adaptec 6805E controller with 6 disk RAID10 pveperf gives us 700 MB/s sequential and 3000 fsyncs/s random with only noatime mount option, so no complaining here.