[SOLVED] Host freeze input/output error

jambove

New Member
Feb 17, 2021
2
1
3
26
I have installed proxmox 6.3 INTEL NUC10I7FNH (64GB RAM, 1 TB SSD) a week ago and added a Debian VM and it has been working fine. 3 days ago I added another Debian VM and then a Windows 10 VM. Ever since the guests have been halting and the host has been throwing Input/Output error to every single command until it freezes completely.

I suspected the only SSD (Kingston) attached, so ran badblocks test, smartctl test, dd read test, and even a memtest. The host passed multiple times. I managed to grab a dmesg log that I am attaching now.

Any help is appreciated

EDIT: I monitored the host with htop, and cpu usage was above 100% many times by the process kvm running the Win10 VM
 

Attachments

  • dmesg.log
    28.3 KB · Views: 12
  • vmconfs.txt
    1.1 KB · Views: 2
  • dd.log
    197 bytes · Views: 0
  • badblocks.log
    218 bytes · Views: 3
  • smartctl.log
    2.5 KB · Views: 5
  • pveversion.txt
    58 bytes · Views: 1
  • uname.txt
    94 bytes · Views: 1
  • cpu.txt
    15.3 KB · Views: 4
  • lsblk.txt
    1.1 KB · Views: 3
Last edited:
The line nvme nvme0: I/O 503 QID 5 timeout, aborting would clearly indicate to me that the NVMe SSD is experiencing some trouble. badblocks and co are not perfect representations of a VM workload, so they might not trigger the fault, and the kernel logs leave very little room for interpretation.

Maybe try a different slot, or update your BIOS, or even the drive's firmware. Also check for any misconfigured PCIe settings in the BIOS setup. Otherwise I'd say faulty hardware.

EDIT: I monitored the host with htop, and cpu usage was above 100% many times by the process kvm running the Win10 VM
The percentage is a total over all assigned cores, so if your VM is configured with 4 cores, then the theoretical maximum usage stat would be 400%. As to why it spikes up, Windows often does updates or Windows Defender scans in the background, which may use quite a bit of CPU.
 
  • Like
Reactions: jambove
The line nvme nvme0: I/O 503 QID 5 timeout, aborting would clearly indicate to me that the NVMe SSD is experiencing some trouble. badblocks and co are not perfect representations of a VM workload, so they might not trigger the fault, and the kernel logs leave very little room for interpretation.

Maybe try a different slot, or update your BIOS, or even the drive's firmware. Also check for any misconfigured PCIe settings in the BIOS setup. Otherwise I'd say faulty hardware.


The percentage is a total over all assigned cores, so if your VM is configured with 4 cores, then the theoretical maximum usage stat would be 400%. As to why it spikes up, Windows often does updates or Windows Defender scans in the background, which may use quite a bit of CPU.
I appreciate the reply Stefan. It turned out to be a faulty SSD - so hardware problem. Got a new SSD and all is good for now :)
 
  • Like
Reactions: Stefan_R

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!