Troubleshooting io-errors

MasterMorgan · Dec 10, 2024

I have a VM that experiences the yellow triangle with the status: io-error. It happens at random times usually once every 24 hours, but with no pattern. The only way to recover is to perform a hard stop of the VM. I am trying to find information on how to troubleshoot this error. According to this thread: https://forum.proxmox.com/threads/getting-io-error-on-vm.123348/, io-error means problem with one of the VM's disks. First question is if someone can confirm that? Or could an io-error be caused by a network or memory issue?

I am on pve-manager/8.3.1 and kernel 6.8.12-5-pve, but the error happens on other kernels as well. The server is a single node and the VM disks are stored on a simple btrfs raid 0 file system across two disks. I have no errors from either smartctl or btrfs check. Number of I/O operations to disk does not seem to matter as I can stress test the VM for hours without the error occurring. It just decides to sometimes... I have also moved the disk file to another separate drive, but it made no difference. I have also tried switching back and forth between RAW and QCOW2 formats. There are no disk errors in the VM itself. Other VMs run on this host with the same OS without issue.

I can see from the VM summary when it freezes, but I cannot find anything in any log in the time leading up to the error. No disk errors, nothing. The only thing in the log is the host pinging the quest agent and not succeeding:
VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout

Surely there must be a log somewhere of something happening. I use journalctl and dmesg but cannot find anything at all. I also looked at the tasks logs of the affected VM with 'pvenode task log', but it just has regular start/stop/vnc events. Is it possible to turn on more logging somehow, or are there more logs that I am not aware of?

Here is the qm config of the VM itself, with some things redacted:
agent: 1
balloon: 3072
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 4
cpu: x86-64-v2-AES
efidisk0: HDD:102/vm-102-disk-0.raw,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
machine: q35
memory: 6144
meta: creation-qemu=8.1.5,ctime=1709494696
name: <name redacted>
net0: virtio=BC:24:11:XX:XX:XX,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: HDD:102/vm-102-disk-1.qcow2,iothread=1,size=127G
scsihw: virtio-scsi-single
smbios1: uuid=bea72a8d-726d-4575-8628-d432f72d40b4
sockets: 1
unused0: WD2:102/vm-102-disk-0.raw
vmgenid: 65bfa45d-xxxx-xxxx-xxxx-xxxxxxxxfb72

Any help greatly appreciated!

alexskysilk · Dec 10, 2024

MasterMorgan said:
io-error means problem with one of the VM's disks.

the most obvious culprit would be whatever the underlying disk is for your store named "HDD". check dmesg for any clues.

MasterMorgan · Dec 10, 2024

Thanks for replying! HDD is the RAID 0 btrfs disks. There are absolutely zero error or warning messages related to anything storage (HBA or disks) in neither dmesg or journaltctl. Other VMs are stored on the same storage and have no problems. I have run smartctl test, btrfs scrub etc. but found nothing.

Do you have any suggestions as to what to look for?
Can you confirm that status:io-error is always related to storage/disk?
Can logging be increased to see what is going on? Since PVE detects that error it must do so based on _something_.

alexskysilk · Dec 10, 2024

the answer lies in the logs. I was reaching for the low hanging fruit but since you've ruled that out, you'd need to dig deeper.

MasterMorgan said:
I can see from the VM summary when it freezes, but I cannot find anything in any log in the time leading up to the error.

Which log?

leesteken · Dec 10, 2024

A VM that freezes with an I/O-error is usually because it needs to write to a virtual disk (which should have space) but the underlying storage has no free space (since it is sparse and over-committed by multiple VMs). You might not see logs in the VM as there is no way to write them. Then again, you already moved the virtual disk to another storage, so your issue might be completely different.

MasterMorgan · Dec 10, 2024

I have looked at these:

journalctl --since "<1-2 hours before the VM freezes>"
Nothing that relates to storage, or any other error.
dmesg
Also nothing
pvenode task list --errors --vmid 102
This list only contains qmshutdown, vncproxy, qmstart

And I agree with you; there must be some trace of what is going on somewhere. I just don't know where to look

MasterMorgan · Dec 10, 2024

VM just froze again. This is the output from journalctl on the host:
Dec 10 20:46:43 host pveproxy[717028]: worker exit
Dec 10 20:46:44 host pveproxy[1534]: worker 717028 finished
Dec 10 20:46:44 host pveproxy[1534]: starting 1 worker(s)
Dec 10 20:46:44 host pveproxy[1534]: worker 743196 started
Dec 10 20:51:39 host pvedaemon[736634]: VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout
Dec 10 20:52:00 host pvedaemon[736634]: VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout
Dec 10 20:52:19 host pvedaemon[738341]: VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout
Dec 10 20:52:39 host pvedaemon[736634]: VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout

And from the same time inside the VM:
Dec 10 20:50:53 vm1 qemu-ga[791]: info: guest-ping called
Dec 10 20:51:04 vm1 qemu-ga[791]: info: guest-ping called
Dec 10 20:51:15 vm1 qemu-ga[791]: info: guest-ping called
Dec 10 20:51:25 vm1 qemu-ga[791]: info: guest-ping called
-- Boot 1a1ce78dbbc74f33b1d7e0d39fa215d3 --

I can't account for the difference in time. The clocks seem to be in sync when I execute the date command manually.

MasterMorgan · Jan 19, 2025

Based on findings by @fossaaen in this post. I believe I have found a solution.

Apparently, when a system reaches 80 % memory utilization a feature called Kernel Samepage Merging (KSM) kicks off. It scans all the pages in memory and merges them when it finds duplicates. On my system this saved around 3.4 GB of memory. Since the error in my case was called 'io-error' I was looking at network and storage for a cause, but is seems it might be KSM.

I disabled KSM on the system and the VM has since been running stable. My theory is that as long as KSM was not active (memory usage below 80 %) everything is fine, but once KSM becomes active it is only a matter of time until the VM hangs. The VM I am struggling with is the one with most heavy memory usage. I have no root cause, just the empirical observation that turning KSM off seems to solve it.

I am still quite surprised that there is no log available to troubleshoot such systems.

Search

Search

Troubleshooting io-errors

MasterMorgan

New Member

alexskysilk

Distinguished Member

MasterMorgan

New Member

alexskysilk

Distinguished Member

leesteken

Distinguished Member

MasterMorgan

New Member

MasterMorgan

New Member

MasterMorgan

New Member

We value your privacy