Proxmox VE very high load average due to process "z_wr_iss"

dennis1024 · Friday at 17:59

Hi,

I'm running Proxmox VE on a Dell Precision 3660 Tower Workstation on ZFS with 128 GB RAM and 13th Gen Intel i9-13900K (24 Cores, 32 Threads).
Every couple of weeks the Proxmox VE gets stuck because of a very high load average due to processes z_wr_iss, so that only a hard reset works.

I've checked the logs and performed several hardware checks (disks, RAM, ...). No hardware issues reported, though.
I've already found similar issues in this forum, but they're related to older kernels.

Some additional information:

uname -a

Linux workstation 6.8.12-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) x86_64 GNU/Linux

df -h

Filesystem Size Used Avail Use% Mounted on
udev 63G 0 63G 0% /dev
tmpfs 13G 3.1M 13G 1% /run
rpool/ROOT/pve-1 2.7T 342G 2.4T 13% /
tmpfs 63G 46M 63G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
efivarfs 438K 359K 75K 83% /sys/firmware/efi/efivars
rpool 2.4T 128K 2.4T 1% /rpool
rpool/var-lib-vz 2.4T 128K 2.4T 1% /var/lib/vz
rpool/ROOT 2.4T 128K 2.4T 1% /rpool/ROOT
rpool/data 2.4T 128K 2.4T 1% /rpool/data
/dev/fuse 128M 48K 128M 1% /etc/pve
tmpfs 13G 0 13G 0% /run/user/0

zpool status

pool: rpool
state: ONLINE
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme-eui.0025384b3143ad0d-part3 ONLINE 0 0 0
nvme-eui.0025384b3143ad13-part3 ONLINE 0 0 0

errors: No known data errors

top

top - 15:58:04 up 8:21, 2 users, load average: 34.68, 33.16, 25.85
Tasks: 625 total, 3 running, 622 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 6.9 sy, 0.0 ni, 64.1 id, 29.1 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 128501.2 total, 113606.5 free, 15648.2 used, 223.7 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 112853.0 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
117816 root 1 -19 0 0 0 R 5.6 0.0 1:18.74 z_wr_iss
630 root 1 -19 0 0 0 S 5.3 0.0 2:41.88 z_wr_iss
112681 root 1 -19 0 0 0 S 5.3 0.0 1:19.29 z_wr_iss
114908 root 1 -19 0 0 0 S 5.3 0.0 1:18.93 z_wr_iss
115232 root 1 -19 0 0 0 S 5.3 0.0 1:19.10 z_wr_iss
115632 root 1 -19 0 0 0 S 5.3 0.0 1:18.99 z_wr_iss
116337 root 1 -19 0 0 0 S 5.3 0.0 1:18.83 z_wr_iss
117190 root 1 -19 0 0 0 S 5.3 0.0 1:18.87 z_wr_iss
117364 root 1 -19 0 0 0 S 5.3 0.0 1:18.75 z_wr_iss
117486 root 1 -19 0 0 0 S 5.3 0.0 1:18.77 z_wr_iss
117734 root 1 -19 0 0 0 S 5.3 0.0 1:18.92 z_wr_iss
118000 root 1 -19 0 0 0 S 5.3 0.0 1:18.85 z_wr_iss
118176 root 1 -19 0 0 0 S 5.3 0.0 1:18.92 z_wr_iss
118185 root 1 -19 0 0 0 S 5.3 0.0 1:18.89 z_wr_iss
118224 root 1 -19 0 0 0 S 5.3 0.0 1:18.67 z_wr_iss
118250 root 1 -19 0 0 0 S 5.3 0.0 1:19.05 z_wr_iss
118365 root 1 -19 0 0 0 S 5.3 0.0 1:18.87 z_wr_iss
118419 root 1 -19 0 0 0 S 5.3 0.0 1:18.98 z_wr_iss
118551 root 1 -19 0 0 0 S 5.3 0.0 1:18.73 z_wr_iss
118580 root 1 -19 0 0 0 S 5.3 0.0 1:18.99 z_wr_iss
118665 root 1 -19 0 0 0 S 5.3 0.0 1:18.87 z_wr_iss
114642 root 1 -19 0 0 0 S 5.0 0.0 1:18.91 z_wr_iss
117851 root 1 -19 0 0 0 S 5.0 0.0 1:18.82 z_wr_iss
118148 root 1 -19 0 0 0 R 5.0 0.0 1:18.90 z_wr_iss
118211 root 1 -19 0 0 0 S 5.0 0.0 1:18.83 z_wr_iss

As we can see from top, this shouldn't be a memory-related issue, since most of the memory is free, even though the load average of 34.68 is very high due to high CPU utilization.

Thank you in advance.

Best Regards,
Dennis

waltar · Friday at 19:09

When did you run last zfs scrub check and do you run it by cron (eg weekly ?) ?
Your rpool is a stripe (and not a mirror) - what's the full output of zpool status ?
In top you have high load because of high I/O wait (29.1) and the cores want to be used but cannot.
z_wr_iss are the zfs threads for writing.
What about the smart future lifetime indicators values to your nvme's (smartctl -x /dev/nvme0n1 |nvme1n1) ?

dennis1024 · Friday at 21:01

waltar said:
When did you run last zfs scrub check and do you run it by cron (eg weekly ?) ?
Your rpool is a stripe (and not a mirror) - what's the full output of zpool status ?
In top you have high load because of high I/O wait (29.1) and the cores want to be used but cannot.
z_wr_iss are the zfs threads for writing.
What about the smart future lifetime indicators values to your nvme's (smartctl -x /dev/nvme0n1 |nvme1n1) ?

When did you run last zfs scrub check and do you run it by cron (eg weekly ?) ?

Scrub timer:

root@workstation:~# systemctl list-timers --all | grep scrub
Sun 2024-10-27 03:10:00 CET 1 day 7h left Mon 2024-10-21 08:34:59 CEST 4 days ago e2scrub_all.timer e2scrub_all.service

Your rpool is a stripe (and not a mirror) - what's the full output of zpool status ?

zpool status -v (after manual scrub has finished)

zpool status -v
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:07:23 with 0 errors on Fri Oct 25 20:59:35 2024
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme-eui.0025384b3143ad0d-part3 ONLINE 0 0 0
nvme-eui.0025384b3143ad13-part3 ONLINE 0 0 0

errors: No known data errors

Relevant part of the logs:

Oct 25 20:52:12 workstation zed[7836]: eid=6 class=scrub_start pool='rpool'
Oct 25 20:59:35 workstation zed[9877]: eid=9 class=scrub_finish pool='rpool'

What about the smart future lifetime indicators values to your nvme's (smartctl -x /dev/nvme0n1 |nvme1n1) ?

smartctl -x /dev/nvme0n1

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-2-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 990 PRO with Heatsink 2TB
Serial Number: S7DRNJ0WB42405A
Firmware Version: 0B2QJXG7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 2.0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization: 1,628,955,897,856 [1.62 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 4b3143ad13
Local Time is: Fri Oct 25 20:46:54 2024 CEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055): Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f): S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 85 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 9.39W - - 0 0 0 0 0 0
1 + 9.39W - - 1 1 1 1 0 0
2 + 9.39W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 4200 2700
4 - 0.0050W - - 4 4 4 4 500 21800

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 27 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 6,771,434 [3.46 TB]
Data Units Written: 9,484,663 [4.85 TB]
Host Read Commands: 77,310,111
Host Write Commands: 80,447,126
Controller Busy Time: 294
Power Cycles: 235
Power On Hours: 748
Unsafe Shutdowns: 38
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 27 Celsius
Temperature Sensor 2: 25 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

smartctl -x /dev/nvme1n1

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-2-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 990 PRO with Heatsink 2TB
Serial Number: S7DRNJ0WB42399E
Firmware Version: 0B2QJXG7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 2.0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization: 1,789,531,348,992 [1.78 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 4b3143ad0d
Local Time is: Fri Oct 25 20:48:09 2024 CEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055): Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f): S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 85 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 9.39W - - 0 0 0 0 0 0
1 + 9.39W - - 1 1 1 1 0 0
2 + 9.39W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 4200 2700
4 - 0.0050W - - 4 4 4 4 500 21800

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 28 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 6,572,256 [3.36 TB]
Data Units Written: 17,264,600 [8.83 TB]
Host Read Commands: 114,531,367
Host Write Commands: 537,268,221
Controller Busy Time: 350
Power Cycles: 239
Power On Hours: 807
Unsafe Shutdowns: 42
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 28 Celsius
Temperature Sensor 2: 25 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

waltar · Saturday at 09:06

That looks ok yet. Take a look eg with "iotop -o" (apt install iotop) to see which processes generate your I/O.

dennis1024 · Saturday at 10:23

waltar said:
That looks ok yet. Take a look eg with "iotop -o" (apt install iotop) to see which processes generate your I/O.

Yes, I'll try that as soon as the issue happens again, assuming that iotop will show "z_wr_iss" processes, which all share a common PPID of 2:

ps -ef | grep z_wr_iss (system is idle now, so no high I/O to be expected at the moment)

root 651 2 0 08:46 ? 00:00:00 [z_wr_iss]
root 652 2 0 08:46 ? 00:00:00 [z_wr_iss_h]
root 21482 2 0 10:09 ? 00:00:00 [z_wr_iss]
root 21658 2 0 10:09 ? 00:00:00 [z_wr_iss]
root 21670 2 0 10:09 ? 00:00:00 [z_wr_iss]
root 21739 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21820 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21821 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21894 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21895 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21903 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21905 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21951 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21952 2 0 10:10 ? 00:00:00 [z_wr_iss]
root 21984 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 21985 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 21987 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 21996 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 22102 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 22148 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 22181 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 22186 2 0 10:11 ? 00:00:00 [z_wr_iss]
root 22255 2 0 10:12 ? 00:00:00 [z_wr_iss]
root 22300 2 0 10:12 ? 00:00:00 [z_wr_iss]
root 22528 2 0 10:13 ? 00:00:00 [z_wr_iss]
root 22559 2 0 10:13 ? 00:00:00 [z_wr_iss]

strings /proc/2/comm

kthreadd

"z_wr_iss" causing high I/O is only the symptom, though. I wonder what's causing this.
I don't think this is as random as it appears.
Could be related to linked clones and/or snapshotting in a high frequency.
I'll perform additional tests and get back to you.

Thanks, waltar.

Search

Search

Proxmox VE very high load average due to process "z_wr_iss"

dennis1024

New Member

waltar

Active Member

dennis1024

New Member

waltar

Active Member

dennis1024

New Member