Troubleshooting io-errors

MasterMorgan

New Member
Mar 3, 2024
4
0
1
I have a VM that experiences the yellow triangle with the status: io-error. It happens at random times usually once every 24 hours, but with no pattern. The only way to recover is to perform a hard stop of the VM. I am trying to find information on how to troubleshoot this error. According to this thread: https://forum.proxmox.com/threads/getting-io-error-on-vm.123348/, io-error means problem with one of the VM's disks. First question is if someone can confirm that? Or could an io-error be caused by a network or memory issue?

I am on pve-manager/8.3.1 and kernel 6.8.12-5-pve, but the error happens on other kernels as well. The server is a single node and the VM disks are stored on a simple btrfs raid 0 file system across two disks. I have no errors from either smartctl or btrfs check. Number of I/O operations to disk does not seem to matter as I can stress test the VM for hours without the error occurring. It just decides to sometimes... I have also moved the disk file to another separate drive, but it made no difference. I have also tried switching back and forth between RAW and QCOW2 formats. There are no disk errors in the VM itself. Other VMs run on this host with the same OS without issue.

I can see from the VM summary when it freezes, but I cannot find anything in any log in the time leading up to the error. No disk errors, nothing. The only thing in the log is the host pinging the quest agent and not succeeding:
VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout

Surely there must be a log somewhere of something happening. I use journalctl and dmesg but cannot find anything at all. I also looked at the tasks logs of the affected VM with 'pvenode task log', but it just has regular start/stop/vnc events. Is it possible to turn on more logging somehow, or are there more logs that I am not aware of?

Here is the qm config of the VM itself, with some things redacted:
agent: 1
balloon: 3072
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 4
cpu: x86-64-v2-AES
efidisk0: HDD:102/vm-102-disk-0.raw,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
machine: q35
memory: 6144
meta: creation-qemu=8.1.5,ctime=1709494696
name: <name redacted>
net0: virtio=BC:24:11:XX:XX:XX,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: HDD:102/vm-102-disk-1.qcow2,iothread=1,size=127G
scsihw: virtio-scsi-single
smbios1: uuid=bea72a8d-726d-4575-8628-d432f72d40b4
sockets: 1
unused0: WD2:102/vm-102-disk-0.raw
vmgenid: 65bfa45d-xxxx-xxxx-xxxx-xxxxxxxxfb72

Any help greatly appreciated!
 
Thanks for replying! HDD is the RAID 0 btrfs disks. There are absolutely zero error or warning messages related to anything storage (HBA or disks) in neither dmesg or journaltctl. Other VMs are stored on the same storage and have no problems. I have run smartctl test, btrfs scrub etc. but found nothing.

Do you have any suggestions as to what to look for?
Can you confirm that status:io-error is always related to storage/disk?
Can logging be increased to see what is going on? Since PVE detects that error it must do so based on _something_.

:)
 
A VM that freezes with an I/O-error is usually because it needs to write to a virtual disk (which should have space) but the underlying storage has no free space (since it is sparse and over-committed by multiple VMs). You might not see logs in the VM as there is no way to write them. Then again, you already moved the virtual disk to another storage, so your issue might be completely different.
 
I have looked at these:
  • journalctl --since "<1-2 hours before the VM freezes>"
    Nothing that relates to storage, or any other error.
  • dmesg
    Also nothing
  • pvenode task list --errors --vmid 102
    This list only contains qmshutdown, vncproxy, qmstart
And I agree with you; there must be some trace of what is going on somewhere. I just don't know where to look :)
 
VM just froze again. This is the output from journalctl on the host:
Dec 10 20:46:43 host pveproxy[717028]: worker exit
Dec 10 20:46:44 host pveproxy[1534]: worker 717028 finished
Dec 10 20:46:44 host pveproxy[1534]: starting 1 worker(s)
Dec 10 20:46:44 host pveproxy[1534]: worker 743196 started
Dec 10 20:51:39 host pvedaemon[736634]: VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout
Dec 10 20:52:00 host pvedaemon[736634]: VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout
Dec 10 20:52:19 host pvedaemon[738341]: VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout
Dec 10 20:52:39 host pvedaemon[736634]: VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout

And from the same time inside the VM:
Dec 10 20:50:53 vm1 qemu-ga[791]: info: guest-ping called
Dec 10 20:51:04 vm1 qemu-ga[791]: info: guest-ping called
Dec 10 20:51:15 vm1 qemu-ga[791]: info: guest-ping called
Dec 10 20:51:25 vm1 qemu-ga[791]: info: guest-ping called
-- Boot 1a1ce78dbbc74f33b1d7e0d39fa215d3 --

I can't account for the difference in time. The clocks seem to be in sync when I execute the date command manually.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!