This issue has been solved. This thread is for anybody having slow I/O performance and searching for keywords. The cause might be ZFS trimming the rpool. I’m running proxmox version 7.3-1 on a Supermicro A2SDi-8C+-HLN4F with HP SSD EX920 1TB for the rpool.
My proxmox node was unresponsive. VMs were showing 100% CPU in TGTD (the scsi target). Containers were completely dead. Logins to the hypervisor with SSH taking about a minute and running commands taking many seconds. I rebooted the server and it took over 10 minutes to boot instead of the normal 30 seconds, some of the startup services timing out, and no improvement afterwards. I then power cycled the server with the same result.
Fearing a hardware failure I checked smartctl but the NVMe is healthy. I then checked the ZFS rpool. This gave the first hint of the problem.
That trimming keyword is unexpected. I cancelled the trim operation using
And rebooted and everything is back to normal. I’m not sure how common this problem is as this is the first time I’ve seen it. It has been trimming for years without any problems and usually finishes in a few minutes. In this case ZFS somehow got stuck trimming and tanked I/O performance, even across reboots and power cycles. Hopefully this helps anybody else with the same problem.
My proxmox node was unresponsive. VMs were showing 100% CPU in TGTD (the scsi target). Containers were completely dead. Logins to the hypervisor with SSH taking about a minute and running commands taking many seconds. I rebooted the server and it took over 10 minutes to boot instead of the normal 30 seconds, some of the startup services timing out, and no improvement afterwards. I then power cycled the server with the same result.
Fearing a hardware failure I checked smartctl but the NVMe is healthy. I then checked the ZFS rpool. This gave the first hint of the problem.
Code:
# zpool status rpool
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:10:43 with 0 errors on Sun Jan 8 00:34:44 2023
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme0n1p2 ONLINE 0 0 0 (trimming)
That trimming keyword is unexpected. I cancelled the trim operation using
Code:
zpool trim -c rpool
And rebooted and everything is back to normal. I’m not sure how common this problem is as this is the first time I’ve seen it. It has been trimming for years without any problems and usually finishes in a few minutes. In this case ZFS somehow got stuck trimming and tanked I/O performance, even across reboots and power cycles. Hopefully this helps anybody else with the same problem.