Proxmox VE very high load average due to process "z_wr_iss"

dennis1024 · Oct 25, 2024

Hi,

I'm running Proxmox VE on a Dell Precision 3660 Tower Workstation on ZFS with 128 GB RAM and 13th Gen Intel i9-13900K (24 Cores, 32 Threads).
Every couple of weeks the Proxmox VE gets stuck because of a very high load average due to processes z_wr_iss, so that only a hard reset works.

I've checked the logs and performed several hardware checks (disks, RAM, ...). No hardware issues reported, though.
I've already found similar issues in this forum, but they're related to older kernels.

Some additional information:

uname -a

Linux workstation 6.8.12-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) x86_64 GNU/Linux

df -h

Filesystem Size Used Avail Use% Mounted on
udev 63G 0 63G 0% /dev
tmpfs 13G 3.1M 13G 1% /run
rpool/ROOT/pve-1 2.7T 342G 2.4T 13% /
tmpfs 63G 46M 63G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
efivarfs 438K 359K 75K 83% /sys/firmware/efi/efivars
rpool 2.4T 128K 2.4T 1% /rpool
rpool/var-lib-vz 2.4T 128K 2.4T 1% /var/lib/vz
rpool/ROOT 2.4T 128K 2.4T 1% /rpool/ROOT
rpool/data 2.4T 128K 2.4T 1% /rpool/data
/dev/fuse 128M 48K 128M 1% /etc/pve
tmpfs 13G 0 13G 0% /run/user/0

zpool status

pool: rpool
state: ONLINE
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme-eui.0025384b3143ad0d-part3 ONLINE 0 0 0
nvme-eui.0025384b3143ad13-part3 ONLINE 0 0 0

errors: No known data errors

top

top - 15:58:04 up 8:21, 2 users, load average: 34.68, 33.16, 25.85
Tasks: 625 total, 3 running, 622 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 6.9 sy, 0.0 ni, 64.1 id, 29.1 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 128501.2 total, 113606.5 free, 15648.2 used, 223.7 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 112853.0 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
117816 root 1 -19 0 0 0 R 5.6 0.0 1:18.74 z_wr_iss
630 root 1 -19 0 0 0 S 5.3 0.0 2:41.88 z_wr_iss
112681 root 1 -19 0 0 0 S 5.3 0.0 1:19.29 z_wr_iss
114908 root 1 -19 0 0 0 S 5.3 0.0 1:18.93 z_wr_iss
115232 root 1 -19 0 0 0 S 5.3 0.0 1:19.10 z_wr_iss
115632 root 1 -19 0 0 0 S 5.3 0.0 1:18.99 z_wr_iss
116337 root 1 -19 0 0 0 S 5.3 0.0 1:18.83 z_wr_iss
117190 root 1 -19 0 0 0 S 5.3 0.0 1:18.87 z_wr_iss
117364 root 1 -19 0 0 0 S 5.3 0.0 1:18.75 z_wr_iss
117486 root 1 -19 0 0 0 S 5.3 0.0 1:18.77 z_wr_iss
117734 root 1 -19 0 0 0 S 5.3 0.0 1:18.92 z_wr_iss
118000 root 1 -19 0 0 0 S 5.3 0.0 1:18.85 z_wr_iss
118176 root 1 -19 0 0 0 S 5.3 0.0 1:18.92 z_wr_iss
118185 root 1 -19 0 0 0 S 5.3 0.0 1:18.89 z_wr_iss
118224 root 1 -19 0 0 0 S 5.3 0.0 1:18.67 z_wr_iss
118250 root 1 -19 0 0 0 S 5.3 0.0 1:19.05 z_wr_iss
118365 root 1 -19 0 0 0 S 5.3 0.0 1:18.87 z_wr_iss
118419 root 1 -19 0 0 0 S 5.3 0.0 1:18.98 z_wr_iss
118551 root 1 -19 0 0 0 S 5.3 0.0 1:18.73 z_wr_iss
118580 root 1 -19 0 0 0 S 5.3 0.0 1:18.99 z_wr_iss
118665 root 1 -19 0 0 0 S 5.3 0.0 1:18.87 z_wr_iss
114642 root 1 -19 0 0 0 S 5.0 0.0 1:18.91 z_wr_iss
117851 root 1 -19 0 0 0 S 5.0 0.0 1:18.82 z_wr_iss
118148 root 1 -19 0 0 0 R 5.0 0.0 1:18.90 z_wr_iss
118211 root 1 -19 0 0 0 S 5.0 0.0 1:18.83 z_wr_iss

As we can see from top, this shouldn't be a memory-related issue, since most of the memory is free, even though the load average of 34.68 is very high due to high CPU utilization.

Thank you in advance.

Best Regards,
Dennis

waltar · Oct 25, 2024

When did you run last zfs scrub check and do you run it by cron (eg weekly ?) ?
Your rpool is a stripe (and not a mirror) - what's the full output of zpool status ?
In top you have high load because of high I/O wait (29.1) and the cores want to be used but cannot.
z_wr_iss are the zfs threads for writing.
What about the smart future lifetime indicators values to your nvme's (smartctl -x /dev/nvme0n1 |nvme1n1) ?

dennis1024 · Oct 25, 2024

waltar said:
When did you run last zfs scrub check and do you run it by cron (eg weekly ?) ?
Your rpool is a stripe (and not a mirror) - what's the full output of zpool status ?
In top you have high load because of high I/O wait (29.1) and the cores want to be used but cannot.
z_wr_iss are the zfs threads for writing.
What about the smart future lifetime indicators values to your nvme's (smartctl -x /dev/nvme0n1 |nvme1n1) ?

When did you run last zfs scrub check and do you run it by cron (eg weekly ?) ?

Scrub timer:

root@workstation:~# systemctl list-timers --all | grep scrub
Sun 2024-10-27 03:10:00 CET 1 day 7h left Mon 2024-10-21 08:34:59 CEST 4 days ago e2scrub_all.timer e2scrub_all.service

Your rpool is a stripe (and not a mirror) - what's the full output of zpool status ?

zpool status -v (after manual scrub has finished)

zpool status -v
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:07:23 with 0 errors on Fri Oct 25 20:59:35 2024
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme-eui.0025384b3143ad0d-part3 ONLINE 0 0 0
nvme-eui.0025384b3143ad13-part3 ONLINE 0 0 0

errors: No known data errors

Relevant part of the logs:

Oct 25 20:52:12 workstation zed[7836]: eid=6 class=scrub_start pool='rpool'
Oct 25 20:59:35 workstation zed[9877]: eid=9 class=scrub_finish pool='rpool'

What about the smart future lifetime indicators values to your nvme's (smartctl -x /dev/nvme0n1 |nvme1n1) ?

smartctl -x /dev/nvme0n1

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-2-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 990 PRO with Heatsink 2TB
Serial Number: S7DRNJ0WB42405A
Firmware Version: 0B2QJXG7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 2.0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization: 1,628,955,897,856 [1.62 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 4b3143ad13
Local Time is: Fri Oct 25 20:46:54 2024 CEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055): Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f): S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 85 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 9.39W - - 0 0 0 0 0 0
1 + 9.39W - - 1 1 1 1 0 0
2 + 9.39W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 4200 2700
4 - 0.0050W - - 4 4 4 4 500 21800

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 27 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 6,771,434 [3.46 TB]
Data Units Written: 9,484,663 [4.85 TB]
Host Read Commands: 77,310,111
Host Write Commands: 80,447,126
Controller Busy Time: 294
Power Cycles: 235
Power On Hours: 748
Unsafe Shutdowns: 38
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 27 Celsius
Temperature Sensor 2: 25 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

smartctl -x /dev/nvme1n1

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-2-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 990 PRO with Heatsink 2TB
Serial Number: S7DRNJ0WB42399E
Firmware Version: 0B2QJXG7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 2.0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization: 1,789,531,348,992 [1.78 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 4b3143ad0d
Local Time is: Fri Oct 25 20:48:09 2024 CEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055): Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f): S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 85 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 9.39W - - 0 0 0 0 0 0
1 + 9.39W - - 1 1 1 1 0 0
2 + 9.39W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 4200 2700
4 - 0.0050W - - 4 4 4 4 500 21800

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 28 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 6,572,256 [3.36 TB]
Data Units Written: 17,264,600 [8.83 TB]
Host Read Commands: 114,531,367
Host Write Commands: 537,268,221
Controller Busy Time: 350
Power Cycles: 239
Power On Hours: 807
Unsafe Shutdowns: 42
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 28 Celsius
Temperature Sensor 2: 25 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

waltar · Oct 26, 2024

That looks ok yet. Take a look eg with "iotop -o" (apt install iotop) to see which processes generate your I/O.

dennis1024 · Oct 26, 2024

waltar said:
That looks ok yet. Take a look eg with "iotop -o" (apt install iotop) to see which processes generate your I/O.

Yes, I'll try that as soon as the issue happens again, assuming that iotop will show "z_wr_iss" processes, which all share a common PPID of 2:

ps -ef | grep z_wr_iss (system is idle now, so no high I/O to be expected at the moment)

root 651 2 0 08:46 ? 00:00:00 [z_wr_iss]
root 652 2 0 08:46 ? 00:00:00 [z_wr_iss_h]
root 21482 2 0 10:09 ? 00:00:00 [z_wr_iss]
root 21658 2 0 10:09 ? 00:00:00 [z_wr_iss]
root 21670 2 0 10:09 ? 00:00:00 [z_wr_iss]
root 21739 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21820 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21821 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21894 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21895 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21903 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21905 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21951 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21952 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21984 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 21985 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 21987 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 21996 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 22102 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 22148 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 22181 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 22186 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 22255 2 0 10:12 ? 00:00:00 [z_wr_iss]
root 22300 2 0 10:12 ? 00:00:00 [z_wr_iss]
root 22528 2 0 10:13 ? 00:00:00 [z_wr_iss]
root 22559 2 0 10:13 ? 00:00:00 [z_wr_iss]

strings /proc/2/comm

kthreadd

"z_wr_iss" causing high I/O is only the symptom, though. I wonder what's causing this.
I don't think this is as random as it appears.
Could be related to linked clones and/or snapshotting in a high frequency.
I'll perform additional tests and get back to you.

Thanks, waltar.

dennis1024 · Nov 12, 2024

I though about closing this forum post this week, but unfortunately the same issue as described above happened again today, while only one VM (Widows 11) was running.

top:

top - 11:18:22 up 2:26, 3 users, load average: 29.60, 20.38, 9.59
Tasks: 613 total, 4 running, 609 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.7 us, 8.6 sy, 0.0 ni, 53.8 id, 20.3 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 128501.2 total, 96276.9 free, 27163.8 used, 6189.2 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 101337.4 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4905 root 20 0 17.9g 16.3g 13568 S 300.7 13.0 42:55.69 kvm
651 root 1 -19 0 0 0 S 8.7 0.0 0:30.96 z_wr_iss
30236 root 1 -19 0 0 0 S 8.7 0.0 0:30.50 z_wr_iss
33422 root 1 -19 0 0 0 S 8.7 0.0 0:30.55 z_wr_iss
34956 root 1 -19 0 0 0 S 8.7 0.0 0:30.51 z_wr_iss
36053 root 1 -19 0 0 0 S 8.7 0.0 0:30.46 z_wr_iss
36769 root 1 -19 0 0 0 S 8.7 0.0 0:30.46 z_wr_iss
36826 root 1 -19 0 0 0 S 8.7 0.0 0:30.53 z_wr_iss
36940 root 1 -19 0 0 0 S 8.7 0.0 0:30.52 z_wr_iss
33921 root 1 -19 0 0 0 S 8.3 0.0 0:30.51 z_wr_iss
34328 root 1 -19 0 0 0 R 8.3 0.0 0:30.49 z_wr_iss
34525 root 1 -19 0 0 0 S 8.3 0.0 0:30.53 z_wr_iss

Managed to capture a screenshot of "iotop -oa", which is attached to this message.
System log displayed an error "ZFS has encountered an uncorrectable i/o failure and has been suspended", but after a hard shutdown, neither smarttools nor "zpool status" implied any errors. Scrubbing ZFS now again.

waltar · Nov 12, 2024

As seen in your last line of iotop picture do "echo 1 > /proc/sys/kernel/task_delayacct" to enable I/O% in general before run iotop to see how many a process is hindered to run in % by I/O wait !! top shows 20.3 for wait which is really bad and is the reason to your high load numbers.
Zfs as cow filesystem do technically seq I/O into random I/O in particular for vm files as any changed part goes otherwhere, so a regular filesystem readahead is inefficient (which is reason for zfs bad read in case data exceeds arc beginning >2x). But still you had a write problem to striped nvme (NOT mirrored yet). High fragmentation do high random I/O which follows high I/O wait and at last high server load. But still your pool usage is really small and so fragmentation not that big.
Scrubbing zfs weekly by cron is always a good idea.
What's output of "zfs list" ?

dennis1024 · Nov 12, 2024

waltar said:
As seen in your last line of iotop picture do "echo 1 > /proc/sys/kernel/task_delayacct" to enable I/O% in general before run iotop to see how many a process is hindered to run in % by I/O wait !! top shows 20.3 for wait which is really bad and is the reason to your high load numbers.
Zfs as cow filesystem do technically seq I/O into random I/O in particular for vm files as any changed part goes otherwhere, so a regular filesystem readahead is inefficient (which is reason for zfs bad read in case data exceeds arc beginning >2x). But still you had a write problem to striped nvme (NOT mirrored yet). High fragmentation do high random I/O which follows high I/O wait and at last high server load. But still your pool usage is really small and so fragmentation not that big.
Scrubbing zfs weekly by cron is always a good idea.
What's output of "zfs list" ?

zfs list

root@workstation:~# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 1.30T 2.21T 104K /rpool
rpool/ROOT 386G 2.21T 96K /rpool/ROOT
rpool/ROOT/pve-1 386G 2.21T 386G /
rpool/data 946G 2.21T 96K /rpool/data
rpool/data/base-104-disk-0 7.27G 2.21T 7.27G -
rpool/data/base-106-disk-0 1.47G 2.21T 1.47G -
rpool/data/base-107-disk-0 1.70G 2.21T 1.70G -
rpool/data/base-108-disk-0 874M 2.21T 874M -
rpool/data/base-109-disk-0 1.36G 2.21T 1.36G -
rpool/data/base-110-disk-0 1.17G 2.21T 1.17G -
rpool/data/base-111-disk-0 23.3G 2.21T 23.3G -
rpool/data/base-112-disk-0 19.7G 2.21T 19.7G -
rpool/data/base-112-disk-1 153G 2.21T 153G -
rpool/data/base-120-disk-0 1.95G 2.21T 1.95G -
rpool/data/base-124-disk-0 54.1G 2.21T 54.1G -
rpool/data/vm-100-disk-0 66.3G 2.21T 51.1G -
rpool/data/vm-101-disk-0 95.0G 2.21T 79.6G -
rpool/data/vm-102-disk-0 30.3G 2.21T 23.8G -
rpool/data/vm-103-disk-0 85.1G 2.21T 80.3G -
rpool/data/vm-105-disk-0 21.0G 2.21T 21.0G -
rpool/data/vm-113-disk-0 81.8G 2.21T 80.3G -
rpool/data/vm-114-disk-0 711M 2.21T 5.31G -
rpool/data/vm-114-disk-1 9.35G 2.21T 153G -
rpool/data/vm-115-disk-0 400K 2.21T 132K -
rpool/data/vm-115-disk-1 110G 2.21T 62.1G -
rpool/data/vm-115-disk-2 256K 2.21T 64K -
rpool/data/vm-117-disk-0 78.2G 2.21T 78.2G -
rpool/data/vm-118-disk-0 20.3G 2.21T 20.3G -
rpool/data/vm-119-disk-0 80.3G 2.21T 80.3G -
rpool/data/vm-121-disk-0 2.50G 2.21T 2.50G -
rpool/var-lib-vz 96K 2.21T 96K /var/lib/vz

zpool status -v

pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:07:39 with 0 errors on Tue Nov 12 11:58:59 2024
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme-eui.0025384b3143ad0d-part3 ONLINE 0 0 0
nvme-eui.0025384b3143ad13-part3 ONLINE 0 0 0

errors: No known data errors

waltar · Nov 12, 2024

"zpool list" ?

dennis1024 · Nov 12, 2024

waltar said:
"zpool list" ?

zpool list

root@workstation:~# zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 3.62T 1.30T 2.32T - - 4% 35% 1.00x ONLINE -

waltar · Nov 12, 2024

What ashift value has your pool ? Did you do zfs trim regulary your nvme pool ?
Looks like your 990pro's don't like that much your zfs writes even while are luxury consumer nvme's they are not enterprise hw and you cannot consists on hw highest specs for reality usage.
If you would have a hdd pool you could do a virtuel zpool with files generated by truncate (in the real zpool inside a mounted dataset) and once live migrate one vm by the other from real to virtual pool and back before the next vm ... but I'm not sure if that would help on your nvme pool.

dennis1024 · Nov 13, 2024

waltar said:
What ashift value has your pool ? Did you do zfs trim regulary your nvme pool ?
Looks like your 990pro's don't like that much your zfs writes even while are luxury consumer nvme's they are not enterprise hw and you cannot consists on hw highest specs for reality usage.
If you would have a hdd pool you could do a virtuel zpool with files generated by truncate (in the real zpool inside a mounted dataset) and once live migrate one vm by the other from real to virtual pool and back before the next vm ... but I'm not sure if that would help on your nvme pool.

You're right, I should run trim more frequently, so I'll schedule it as a cronjob. Thanks for the advice.

When Proxmox froze yesterday, only my Windows 11 VM was running with MS Word and the Task Manager open, nothing else. No Windows updates or malware scans or defrag was running, so nothing that would justify a high workload in the Windows VM. According to the Windows Task Manager, the Windows machine was 95% idle, but according to Proxmox, the Windows VM was having an irrecoverably high I/O, which caused Proxmox to get stuck. Unfortunately I can't pin that issue on the Windows VM, since the same issue happened in the past with other VMs based on Linux. The only thing my Windows and Linux VMs have in common, is that their virtual disks are encrypted (Veracrypt on Windows, LUKS on Linux), but the I/O overhead due to disk encryption should not be that high, since encryption/decryption is handled by the CPU.

I did not consider the 990pro to be premium disks. They're my default disks in all of my machines and notebooks. Never had any issues until now, though. I don't even know if this issue is related to hardware. If it is, I'd be glad to just replace the disks, which would be an easy fix.
If this is ZFS-related, do you think I'd be better off using EXT4 instead of ZFS?

You mentioned my disk setup is striped across the two NVMEs, not mirrored. Yes, that's on purpose, since I need the FS space. I do regular backups of important VMs and store additional copies on remote machines, instead of mirroring on the same machine.

ashift value

zdb -C rpool | grep ashift
ashift: 12
ashift: 12

dennis1024 · Nov 13, 2024

Wrote an scheduled a script that scrubs async and then trims the disks.

Bash:

#!/bin/bash
readonly pool='rpool'

function scrub_running() {
  local scan='scan: scrub in progress'
  zpool status ${1} | grep -q "${scan}" || return 1
  return 0
}

echo "Running scrub on ${pool} ..."
scrub_running ${pool} || zpool scrub ${pool}

sleep 3
while :; do
  scrub_running ${pool} || break
  sleep 10
  echo "Scrub is still running ..."
done

echo "Running trim"
fstrim -av

exit $?

waltar · Nov 13, 2024

For zfs would use their own "trim" which mean do "zpool trim <poolname>" or use the zfs systemctl tools if installed like eg :
"systemctl enable zfs-trim-weekly@<poolname>.timer --now"

dennis1024 said:
If this is ZFS-releated, do you think I'd be better off using EXT4 instead of ZFS?

If you run zfs in stripe the checksums are quiet useless to repair blocks. Depends on how you create your backups like eg with zfs send|receive then stay with zfs else you could go well with xfs only or lvm/lvm-thin/xfs combo also.
.

dennis1024 · Nov 13, 2024

waltar said:
If you run zfs in stripe the checksums are quiet useless to repair blocks.

Could the issue I'm facing stem from striping instead of mirroring? I'm thinking about reinstalling Proxmox VE with a RAID 1 setup on ZFS, even though that means less disk space. What I need more than disk space is a reliable setup.

waltar · Nov 13, 2024

A "stripe" is theoretically just half as reliable as each broken disk ((all) of the 2 here) would be a crash to pve os and all the vm/lxc to.
As you have a z_wr_iss issue writes go half to one disk but in mirror that doubles the I/O write load and would not it make better yet.

dennis1024 · Dec 18, 2024

It's funny and sad at the same time. Bought a new SSD (same model as before), Reinstalled PVE from scratch, replaced ZFS with XFS, yesterday night after three weeks without any issues, I couldn't login using SSH anymore. Connected a keyboard and screen to the machine. This is what was displayed on the login screen:

Code:

workstation login: XFS (dm-1): log I/O error -5
XFS (dm-1): Filesystem has been shut down due to log error (0x2).
XFS (dm-1): Please unmount the filesystem and rectify the problem
XFS (dm-1): log I/O error -5
XFS (dm-1): log I/O error -5
XFS (dm-1): log I/O error -5
XFS (dm-1): log I/O error -5

Machine is not reacting to keyboard inputs, though. After rebooting and analyzing the logs using journalctl, I see ... well ... nothing at all, since it appears the filesystem got shutdown before any logs could have been written. Even though I'm repeating myself, I wanna point out that this is a new SSD. It's statistically unlikely that two different SSDs have the same issue, right?

So I keep asking myself:
Since this can neither be a ZFS-issue anymore, nor can it be a hardware issue related to the SSD I replaced, what could cause this issue?

waltar · Dec 18, 2024

That looks still like a hw error either by the new disk or the components of your host. Don't think it's sw related to the new pve inst.

dennis1024 · Dec 18, 2024

Buying two broken highend SSDs in a row is highly unlikely, plus I'm using the previous SSD in a different machine since a month without any issues.
smartctl never reported any errors for any of the SSDs, neither did the intensive UEFI checks.
The SSDs are both fine. This is not related to hardware.

Seems like a firmware compatibility issue between Proxmox/Debian and Samsung SSD 990 PRO reported by multiple Proxmox users in this thread.

waltar · Dec 18, 2024

Badly looks like the "990" is not a product which a customer is conditioned from samsung. I even would say M2 nvme's aren't highend (for enterprise usage) anyway but from home customer one would expect much better as like an unknown chinese brand.

Proxmox VE very high load average due to process "z_wr_iss"

New Member

Famous Member

New Member

Famous Member

New Member

New Member

Attachments

Famous Member

New Member

Famous Member

New Member

Famous Member

New Member

New Member

Famous Member

New Member

Famous Member

New Member

Famous Member

New Member

Famous Member

We value your privacy