Recent terrible instability - NVME and pve 9 with ZFS fs?

fribse

Member
Feb 2, 2022
12
1
8
57
Hi All
I've had some wild hangs recently on my small homelab.
It's a SFF w. Intel Core 7 240H. In it I have two m.2's (NVME) one smaller for the OS, and a bigger for the VM/CT's.
After searching like crazy, and not finding a solution, I turned to ChatGPT, and that informed me that sometimes it causes system freesez and hangs between ZFS auto-trim and NVME drives.
So I should disable auto-trim for ZFS.
Is this a red herring or is true? I disabled the auto-trim two days ago, and haven't see the hangs since that, so empirically tested it sounds right, I just find it incredible.

I have had the host running for 4 months now in it's current config, and it is only during the last couple of weeks I've seen the hangs.
 
Last edited:
I am not sure if this is still the case, but I have read in the past that you don't need to enable autotrim, as proxmox has a script that runs periodically to do the same thing.
 
I am not sure if this is still the case, but I have read in the past that you don't need to enable autotrim, as proxmox has a script that runs periodically to do the same thing.
Sorry, I'll specify more.
Server is Asus NUC 15 Pro
RAM: Kingston Fury Impact 64 GB kit
OS: Transcend Solid state-drev MTE410S 256GB M.2
ZFS: Crucial Solid state-drev T705 1TB M.2

It's running two vm's and 8 ct's.
What I see is a GUI that responds very very slowly, and doesn't allow login. The CLI still shows loging, but password is not accepted.
The VM's and CT's doesn't respond.
I can short-press the power button, and that makes it shut down, on reboot it shows different errors in the log.
 
Try to boot into live ISO of Debian, Ubuntu, or some Rescue Linux. Does it function OK?
Mount the disk and examine the logs.
Run some memory and disk tests.
Reset/upgrade/downgrade your BIOS. There were instances reported where recent firmware upgrades made system unusable and user had to downgrade.
You can also try to downgrade or upgrade the Kernel (can be done from rescue disk).
Frankly, the possibilities are many... Could be a bitcoin miner running there that you don't know about...


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Thanks, well unless it somehow got hacked (there is no public access to it) there is no btc on it.
The kernel part could be interesting, there was a proxmox kernel update applied not too long ago.
The HW itself, then it should be quickly degrading, it is very new, but I will run some tests, the pve boot does even show some access to tests...
 
Personally, if you have backups of your VMs and containers, I would wipe proxmox and start again. Before re-installing you can use something like Dr.Parted or Clonezilla Live to boot from USB and check/repair your drives as needed. Upon install, I would put both drives together as a ZFS mirror (assuming they are the same size) and just keep the OS and the VM storage on the same set of disks. I do this in my HP Elite Mini 800, and it works great. I am only using two 1TB NVME drives to do it. I don't store any data on my proxmox node, so it doesn't take much disk space.