Proxmox and ZFS - Unstable under load/extraneous events affecting zpool drives?

Angelo · May 13, 2020

Hi,

We seem to be having some regular and predictable issues with ZFS on our Proxmox hosts - especially under load.

Although the following are 2 separate scenarios, we believe that in both cases, the zpool is under load and causes the same negative outcome.

In summary, we run a cluster of PVE hosts with the following:

Code:

proxmox-ve: 6.2-1 (running kernel: 5.3.13-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-1
pve-kernel-helper: 6.2-1
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-4.15: 5.4-9
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 12.2.13-pve1
ceph-fuse: 12.2.13-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve8
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.3
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-5
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

1. In one scenario, we've had to explicitly disable smartd on ALL our PVE hosts because almost every time that the following 'error' occurs when smartd executes a device check (every 30mins by default), we get VMs hanging and losing their boot sectors (which have to be repaired manually):

Code:

This message was generated by the smartd daemon running on:

   host name:  pve-10
   DNS domain: xxxxxxxxx

The following warning/error was logged by the smartd daemon:

Device: /dev/sde [SAT], failed to read SMART Attribute Data

Device info:
SAMSUNG MZ7KH1T9HAJR-00005, S/N:S47PNA0N206256, WWN:5-002538-e00298998, FW:HXM7404Q, 1.92 TB

After such an event, the zpool shows the following but seems to be healthy:

Code:

zpool: rpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 1 days 03:15:10 with 0 errors on Mon May 11 03:39:11 2020
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda3    ONLINE       1     0     0
            sdk3    ONLINE       0     0     0
            sdc3    ONLINE       0     0     0
            sdd3    ONLINE       0     0     0
            sde3    ONLINE       1     7     0
            sdf3    ONLINE       0     0     0
            sdg3    ONLINE       0     0     0
            sdh3    ONLINE       0     0     0

Subsequent scrubs seem to be fine with no errors (as in the above case).

In the logs, every time such an event happens, we see (similar to) the following:

Code:

Apr 20 23:19:28 pve-04 smartd[5514]: Device: /dev/sda [SAT], failed to read SMART Attribute Data
Apr 20 23:19:28 pve-04 smartd[5514]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
Apr 20 23:19:28 pve-04 pvestatd[2827236]: got timeout
Apr 20 23:19:29 pve-04 zed: eid=20810 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:29 pve-04 smartd[5514]: Warning via /usr/share/smartmontools/smartd-runner to root: successful
Apr 20 23:19:29 pve-04 zed: eid=20811 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:29 pve-04 zed: eid=20812 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:29 pve-04 zed: eid=20813 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:29 pve-04 pvestatd[2827236]: status update time (28.105 seconds)
Apr 20 23:19:30 pve-04 zed: eid=20814 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:30 pve-04 pvemailforward[1538369]: forward mail to <admin>
Apr 20 23:19:30 pve-04 zed: eid=20815 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:30 pve-04 zed: eid=20816 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:30 pve-04 zed: eid=20817 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:30 pve-04 pve-firewall[10752]: firewall update time (26.321 seconds)
Apr 20 23:19:30 pve-04 zed: eid=20818 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:31 pve-04 zed: eid=20819 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:31 pve-04 zed: eid=20820 class=io pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3

All manual smart checks on the affected devices are always normal.

2. In the 2nd scenario, we've seen the SAME occurring (i.e. VM boot partition corruption) when zpools are under significant load.

This happens typically when we are migrating from one PVE host to another via an API call - which seems to NOT respect bandwidth limits set for the cluster - but that's for another posting.

We believe that this is possibly all related to the following:

https://github.com/openzfs/zfs/issues/10095

Any comments/similar observations?

Kind regards,

Angelo.

LnxBil · May 14, 2020

Angelo said:
Any comments/similar observations?

With respect to smart, we also had stability problems with it and external drive shelves. It continuously crashed the whole OS. After disabling it, we had not one crash.

rholighaus · May 14, 2020

This causes me to worry as we are running various Windows Server VMs on ZVOLs.

Any response from Proxmox team yet?

Search

Search

Proxmox and ZFS - Unstable under load/extraneous events affecting zpool drives?

Angelo

Active Member

LnxBil

Distinguished Member

rholighaus

Well-Known Member