[SOLVED] Host freeze input/output error

jambove

New Member
Feb 17, 2021
2
1
3
26
I have installed proxmox 6.3 INTEL NUC10I7FNH (64GB RAM, 1 TB SSD) a week ago and added a Debian VM and it has been working fine. 3 days ago I added another Debian VM and then a Windows 10 VM. Ever since the guests have been halting and the host has been throwing Input/Output error to every single command until it freezes completely.

I suspected the only SSD (Kingston) attached, so ran badblocks test, smartctl test, dd read test, and even a memtest. The host passed multiple times. I managed to grab a dmesg log that I am attaching now.

Any help is appreciated

EDIT: I monitored the host with htop, and cpu usage was above 100% many times by the process kvm running the Win10 VM
 

Attachments

Last edited:
The line nvme nvme0: I/O 503 QID 5 timeout, aborting would clearly indicate to me that the NVMe SSD is experiencing some trouble. badblocks and co are not perfect representations of a VM workload, so they might not trigger the fault, and the kernel logs leave very little room for interpretation.

Maybe try a different slot, or update your BIOS, or even the drive's firmware. Also check for any misconfigured PCIe settings in the BIOS setup. Otherwise I'd say faulty hardware.

EDIT: I monitored the host with htop, and cpu usage was above 100% many times by the process kvm running the Win10 VM
The percentage is a total over all assigned cores, so if your VM is configured with 4 cores, then the theoretical maximum usage stat would be 400%. As to why it spikes up, Windows often does updates or Windows Defender scans in the background, which may use quite a bit of CPU.
 
  • Like
Reactions: jambove
The line nvme nvme0: I/O 503 QID 5 timeout, aborting would clearly indicate to me that the NVMe SSD is experiencing some trouble. badblocks and co are not perfect representations of a VM workload, so they might not trigger the fault, and the kernel logs leave very little room for interpretation.

Maybe try a different slot, or update your BIOS, or even the drive's firmware. Also check for any misconfigured PCIe settings in the BIOS setup. Otherwise I'd say faulty hardware.


The percentage is a total over all assigned cores, so if your VM is configured with 4 cores, then the theoretical maximum usage stat would be 400%. As to why it spikes up, Windows often does updates or Windows Defender scans in the background, which may use quite a bit of CPU.
I appreciate the reply Stefan. It turned out to be a faulty SSD - so hardware problem. Got a new SSD and all is good for now :)
 
  • Like
Reactions: Stefan_R