@gkovacs
I read the zfsonlinux thread you mention. Did you have the Linux swap file on a ZVOL when you had those data corruptions? Did you have any KVM machines running with direct hardware access?
Yes I did have the swap on ZFS (not sure if ZVOL, default Proxmox ZFS RAID10 install). I did not have ANY virtual machines running, I only restored containers and VMs to test for data corruption. Also swap was not or rarely used, as no guests were running.
Also I came across this guy doing some heavy testing of ZOL on a large NUMA system.
http://blog.servercentral.com/zfs-thangs and contributing to improvements in ZOL. Maybe he could test for data corruptions.
Regarding the memory bank issue. That is interesting. I think it may be that with four dimm sockets occupied the system may split the memory in two zones bound to specific cores of the cpu, whereas with just two dimm sockets occupied, all cores would have to contend for the same memory in one SMP zone. This should be visible in
numactl --hardware.
I have checked numactl with all the different memory configurations, but I only ever had a single zone (zone 0), never more. Since it's a single socket system, thats what I was expecting.
This might also affect DMA mappings, but I do not understand why that should affect LVM differently than ZFS, or would it be because ZFS is using more RAM then LVM, thereby thrashing some DMA area because of memory pressures? This is pure speculation, but maybe on high RAM systems that do not come close to swapping this is not visible. Or the newer kernel features like transparent hugepages and the resulting compaction/migration of pages due to the RAM usage of ARC are involved. As the swap on Zvol issues demonstrate I think, there is some negative impact on stability of either the newer zfs on linux code or the kernel or both.
We also experienced some very rare data corruption on ext4 / lvm, but only under MySQL (some InnoDB indexes got corrupted), but MySQL has already demonstrated weird issues under LXC, so I'm not sure it's a related issue.
Maybe it's connected to the checksum computation somehow? Both ZFS and MySQL/InnoDB create checksums on their disk writes... then again, ext4 does as well on the journal, and we never had problems there.
Anyhow, I would really love to test the same version of ZFS (that's included in Proxmox 4.1) under Proxmox 3.4, so if it works alright, then it must be a 4.2 kernel issue (this would be my current bet). I hope that
@dietmar and
@tom can sympathize with our intention of finding this bug, and help us out by creating a ZFS 0.6.5.3 package for PVE 3.4.