PVE 9.0.3 unresponsive random

toasun

New Member
Jan 12, 2026
3
0
1
Hi,heros
I encountered a problem, and it has been happening more frequently lately.

**Hardware:**
PVE version 9.0.3,
E5-2640v4 * 2, 128GB RAM, and 2 SSDs

**Symptoms**
The server becomes unresponsive randomly: 1~7days
When it happen, ping works, but SSH, keyboard input, and web access are all blocked.
NO error log
BUT: All virtual machines remain completely normal.

**Attempt**
I've updated the intel-microcode
Disabled C3/C6 in the motherboard BIOS
Modified the intel_idle.max_cstate=1 setting in GRUB file
Update the kernal.
Yet there's no improvement

Help me my heros!!

In the log file, after the time: Jan 17 16:22:11, server freeze with no log anymore
pve server ip:192.168.0.100
1of pve vms ip:192.168.0.98

web.jpg

ping.jpgssh.jpgssh-vm.jpg
 

Attachments

Last edited:
I quickly scanned your log, & found the following:

Code:
Jan 17 03:03:51 hualong pmxcfs[1378]: [status] notice: RRD update error /var/lib/rrdcached/db/pve-storage-9.0/hualong/local: /var/lib/rrdcached/db/pve-storage-9.0/hualong/local: illegal attempt to update using time 1768590230 when last update time is 1768590230 (minimum one second step) *5
This may indicate some type of time inconsistency/drift.

Code:
Jan 17 06:53:41 hualong kernel: nvme nvme0: I/O tag 560 (c230) opcode 0x1 (I/O Cmd) QID 8 timeout, aborting req_op:WRITE(1) size:131072
Jan 17 06:54:00 hualong kernel: nvme nvme0: I/O tag 561 (e231) opcode 0x1 (I/O Cmd) QID 8 timeout, aborting req_op:WRITE(1) size:77824
Jan 17 06:54:00 hualong kernel: nvme nvme0: I/O tag 558 (722e) opcode 0x1 (I/O Cmd) QID 8 timeout, aborting req_op:WRITE(1) size:8192
Jan 17 06:54:00 hualong kernel: nvme nvme0: Abort status: 0x0
Jan 17 06:54:00 hualong kernel: nvme nvme0: I/O tag 534 (6216) opcode 0x1 (I/O Cmd) QID 8 timeout, aborting req_op:WRITE(1) size:4096
Jan 17 06:53:48 hualong pve-firewall[1554]: firewall update time (19.723 seconds)
Jan 17 06:53:49 hualong pvestatd[1562]: status update time (28.169 seconds)
Jan 17 06:53:53 hualong pve-ha-crm[1602]: loop take too long (32 seconds)
Jan 17 06:53:53 hualong pve-ha-lrm[1617]: loop take too long (34 seconds)
Jan 17 06:54:02 hualong kernel: nvme nvme0: Abort status: 0x0
Jan 17 06:54:07 hualong kernel: nvme nvme0: Abort status: 0x0
Jan 17 06:54:12 hualong kernel: nvme nvme0: Abort status: 0x0
I found a total of 15 QID timeouts This may indicate something that needs looking at that nvme0, (firmware, cabling, , slot placement, cooling or general failing).

Also note that at the same time (of above) & in another instance of QID timeouts at Jan 17 00:40:29, journal entries appear non-chronological. This may be related to either timing problems (as above) or nvme0 drive writes (assuming the logs are written to nvme0). Possibly this is nothing (journalctl collection?) - I don't have personal experience with this.

and it has been happening more frequently lately
This should not be happening at all. For how long have you had these episodes. Since when have they become worse? Kernel changes/updates?

How full is the root partition? Check with:
Code:
df -h

Good luck.
 
First, I would upgrade to the latest version, 9.1.4.

Second, it sounds like the e1000 network card problem again, which has been discussed/addressed here frequently.

Third, the suspicion of a full root partition, as mentioned here, isn't unreasonable either.
 
Last edited: