I have a VM that experiences the yellow triangle with the status: io-error. It happens at random times usually once every 24 hours, but with no pattern. The only way to recover is to perform a hard stop of the VM. I am trying to find information on how to troubleshoot this error. According to this thread: https://forum.proxmox.com/threads/getting-io-error-on-vm.123348/, io-error means problem with one of the VM's disks. First question is if someone can confirm that? Or could an io-error be caused by a network or memory issue?
I am on pve-manager/8.3.1 and kernel 6.8.12-5-pve, but the error happens on other kernels as well. The server is a single node and the VM disks are stored on a simple btrfs raid 0 file system across two disks. I have no errors from either smartctl or btrfs check. Number of I/O operations to disk does not seem to matter as I can stress test the VM for hours without the error occurring. It just decides to sometimes... I have also moved the disk file to another separate drive, but it made no difference. I have also tried switching back and forth between RAW and QCOW2 formats. There are no disk errors in the VM itself. Other VMs run on this host with the same OS without issue.
I can see from the VM summary when it freezes, but I cannot find anything in any log in the time leading up to the error. No disk errors, nothing. The only thing in the log is the host pinging the quest agent and not succeeding:
VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout
Surely there must be a log somewhere of something happening. I use journalctl and dmesg but cannot find anything at all. I also looked at the tasks logs of the affected VM with 'pvenode task log', but it just has regular start/stop/vnc events. Is it possible to turn on more logging somehow, or are there more logs that I am not aware of?
Here is the qm config of the VM itself, with some things redacted:
Any help greatly appreciated!
I am on pve-manager/8.3.1 and kernel 6.8.12-5-pve, but the error happens on other kernels as well. The server is a single node and the VM disks are stored on a simple btrfs raid 0 file system across two disks. I have no errors from either smartctl or btrfs check. Number of I/O operations to disk does not seem to matter as I can stress test the VM for hours without the error occurring. It just decides to sometimes... I have also moved the disk file to another separate drive, but it made no difference. I have also tried switching back and forth between RAW and QCOW2 formats. There are no disk errors in the VM itself. Other VMs run on this host with the same OS without issue.
I can see from the VM summary when it freezes, but I cannot find anything in any log in the time leading up to the error. No disk errors, nothing. The only thing in the log is the host pinging the quest agent and not succeeding:
VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout
Surely there must be a log somewhere of something happening. I use journalctl and dmesg but cannot find anything at all. I also looked at the tasks logs of the affected VM with 'pvenode task log', but it just has regular start/stop/vnc events. Is it possible to turn on more logging somehow, or are there more logs that I am not aware of?
Here is the qm config of the VM itself, with some things redacted:
agent: 1
balloon: 3072
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 4
cpu: x86-64-v2-AES
efidisk0: HDD:102/vm-102-disk-0.raw,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
machine: q35
memory: 6144
meta: creation-qemu=8.1.5,ctime=1709494696
name: <name redacted>
net0: virtio=BC:24:11:XX:XX:XX,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: HDD:102/vm-102-disk-1.qcow2,iothread=1,size=127G
scsihw: virtio-scsi-single
smbios1: uuid=bea72a8d-726d-4575-8628-d432f72d40b4
sockets: 1
unused0: WD2:102/vm-102-disk-0.raw
vmgenid: 65bfa45d-xxxx-xxxx-xxxx-xxxxxxxxfb72
Any help greatly appreciated!