ZFS SIMD patch in 0.8.1-pve2 causes FPU corruption

0xFelix

Member
Oct 25, 2017
23
2
8
26
Hello,

PVE 6 currently ships the faulty ZFS SIMD patch for 5.0+ kernels which is known to cause FPU corruption.

See this issue: https://github.com/zfsonlinux/zfs/issues/9346

It was cherry-picked here: https://git.proxmox.com/?p=zfsonlinux.git;a=commit;h=f43dbfa75207ffa8be7aa8f969f77f9e5a7a582a

This is a really nasty bug, the patch should be reverted immediately until a final solution is found.

I noticed this on my PVE hosts while sending snapshots and generating IO load on the sending ZFS pool at the same time.
It resulted in the sent zfs streams to be corrupted.

Currently it can be mitigated by turning of SIMD acceleration:

Code:
echo scalar > /sys/module/zfs/parameters/zfs_vdev_raidz_impl
echo scalar > /sys/module/zcommon/parameters/zfs_fletcher_4_impl
 

0xFelix

Member
Oct 25, 2017
23
2
8
26
Well... seems like a kernel update to pve-kernel-5.0.21-2-pve / 5.0.21-6 fixes the issue for now... but this kernel was only released yesterday or monday? A host I freshly installed on sunday still got the faulty 5.0.21-3. So this bug was out in the wild for several weeks? Could this have lead to silent data corruption?
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
4,326
673
133
the fix was packaged on Friday, and released on the public repos on Monday. bugs take time to analyze and fix (and verify), unfortunately.

this issue in particular will in most cases lead to visible problems (like the ones originally reported). it has the potential for silent corruption as well, if something in userland uses the FPU but does not or cannot verify the result of the operation. data stored on ZFS itself is unaffected - only operations done by applications in userspace are affected.
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
3,067
523
133
South Tyrol/Italy
shop.maurer-it.com
PVE 6 currently ships the faulty ZFS SIMD patch for 5.0+ kernels which is known to cause FPU corruption.
Some clarification, this specific bug in ZFS SIMD patch from the ZOL master branch only triggers in combinations with 5.0 and 5.1 based kernels, older and newer are not affected as they have different handling of the FPU.

It seems that the bug got introduced in pve-kernel with kernel ABI package "pve-kernel-5.0.21-1-pve" in version 5.0.21-1 and the first kernel including a fix for this is "pve-kernel-5.0.21-2-pve" in version "5.0.21-6".

The problematic FPU methods are mainly used for check-summing done by RAID-Z based modes and scrubs.
There was no possibility of negative effects for ZFS itself, but other processes from userspace which had SIMD/FPU operations in-flight and a context switch to the kernel ZFS module happened when that module also used the FPU.
So all operations done during ZFS scrubs for all ZFS setups or in general when RAID-Z modes where used have a chance.

In practice, once we were aware of the issue and had a reproducer to trigger such errors I could only trigger those when doing a "zpool scrub" during running "stress-ng" floating-point error checking tests - but never in a RAID-Z setup (doing some heavy write, read, delete, repeat cycles) - so there's that.
We'll evaluate this a bit more and see if we can find more specific limitations where this could lead to any corruption of FPU dependent data written by other processes.
 

0xFelix

Member
Oct 25, 2017
23
2
8
26
Thank you for the clarification!

I was worried about data corruption on my affected hosts but since they were only storing data and not running any other applications at the time they should probably be fine.

I could observe the bug in "zfs send". A simple "dd if=/dev/null of=testfile bs=1M" on the sending system / pool was causing the checksums of the sent zfs streams to be corrupt. After installing "pve-kernel-5.0.21-2-pve" in version "5.0.21-6" the bug is no longer reproducible.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!