Hi,
We seem to be having some regular and predictable issues with ZFS on our Proxmox hosts - especially under load.
Although the following are 2 separate scenarios, we believe that in both cases, the zpool is under load and causes the same negative outcome.
In summary, we run a cluster of PVE hosts with the following:
1. In one scenario, we've had to explicitly disable smartd on ALL our PVE hosts because almost every time that the following 'error' occurs when smartd executes a device check (every 30mins by default), we get VMs hanging and losing their boot sectors (which have to be repaired manually):
After such an event, the zpool shows the following but seems to be healthy:
Subsequent scrubs seem to be fine with no errors (as in the above case).
In the logs, every time such an event happens, we see (similar to) the following:
All manual smart checks on the affected devices are always normal.
2. In the 2nd scenario, we've seen the SAME occurring (i.e. VM boot partition corruption) when zpools are under significant load.
This happens typically when we are migrating from one PVE host to another via an API call - which seems to NOT respect bandwidth limits set for the cluster - but that's for another posting.
We believe that this is possibly all related to the following:
https://github.com/openzfs/zfs/issues/10095
Any comments/similar observations?
Kind regards,
Angelo.
We seem to be having some regular and predictable issues with ZFS on our Proxmox hosts - especially under load.
Although the following are 2 separate scenarios, we believe that in both cases, the zpool is under load and causes the same negative outcome.
In summary, we run a cluster of PVE hosts with the following:
Code:
proxmox-ve: 6.2-1 (running kernel: 5.3.13-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-1
pve-kernel-helper: 6.2-1
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-4.15: 5.4-9
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 12.2.13-pve1
ceph-fuse: 12.2.13-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve8
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.3
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-5
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
1. In one scenario, we've had to explicitly disable smartd on ALL our PVE hosts because almost every time that the following 'error' occurs when smartd executes a device check (every 30mins by default), we get VMs hanging and losing their boot sectors (which have to be repaired manually):
Code:
This message was generated by the smartd daemon running on:
host name: pve-10
DNS domain: xxxxxxxxx
The following warning/error was logged by the smartd daemon:
Device: /dev/sde [SAT], failed to read SMART Attribute Data
Device info:
SAMSUNG MZ7KH1T9HAJR-00005, S/N:S47PNA0N206256, WWN:5-002538-e00298998, FW:HXM7404Q, 1.92 TB
After such an event, the zpool shows the following but seems to be healthy:
Code:
zpool: rpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: scrub repaired 0B in 1 days 03:15:10 with 0 errors on Mon May 11 03:39:11 2020
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda3 ONLINE 1 0 0
sdk3 ONLINE 0 0 0
sdc3 ONLINE 0 0 0
sdd3 ONLINE 0 0 0
sde3 ONLINE 1 7 0
sdf3 ONLINE 0 0 0
sdg3 ONLINE 0 0 0
sdh3 ONLINE 0 0 0
Subsequent scrubs seem to be fine with no errors (as in the above case).
In the logs, every time such an event happens, we see (similar to) the following:
Code:
Apr 20 23:19:28 pve-04 smartd[5514]: Device: /dev/sda [SAT], failed to read SMART Attribute Data
Apr 20 23:19:28 pve-04 smartd[5514]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
Apr 20 23:19:28 pve-04 pvestatd[2827236]: got timeout
Apr 20 23:19:29 pve-04 zed: eid=20810 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:29 pve-04 smartd[5514]: Warning via /usr/share/smartmontools/smartd-runner to root: successful
Apr 20 23:19:29 pve-04 zed: eid=20811 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:29 pve-04 zed: eid=20812 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:29 pve-04 zed: eid=20813 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:29 pve-04 pvestatd[2827236]: status update time (28.105 seconds)
Apr 20 23:19:30 pve-04 zed: eid=20814 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:30 pve-04 pvemailforward[1538369]: forward mail to <admin>
Apr 20 23:19:30 pve-04 zed: eid=20815 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:30 pve-04 zed: eid=20816 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:30 pve-04 zed: eid=20817 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:30 pve-04 pve-firewall[10752]: firewall update time (26.321 seconds)
Apr 20 23:19:30 pve-04 zed: eid=20818 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:31 pve-04 zed: eid=20819 class=delay pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
Apr 20 23:19:31 pve-04 zed: eid=20820 class=io pool_guid=0xBB4C01A794517AD2 vdev_path=/dev/sda3
All manual smart checks on the affected devices are always normal.
2. In the 2nd scenario, we've seen the SAME occurring (i.e. VM boot partition corruption) when zpools are under significant load.
This happens typically when we are migrating from one PVE host to another via an API call - which seems to NOT respect bandwidth limits set for the cluster - but that's for another posting.
We believe that this is possibly all related to the following:
https://github.com/openzfs/zfs/issues/10095
Any comments/similar observations?
Kind regards,
Angelo.