migrated VMs to proxmox 9.0.6 (ZFS) and getting status io-error

highland

New Member
Dec 25, 2024
5
0
1
Hello Team,
I had several Splunk instances on proxmox 8.3.4 - all working fine.
I have copied those images to /var/lib/vz/images of a new server with proxmox 9.0.6.
And also copied config files, migration was swift. All VMs started and worked fine.

But after some time started randomly (on average once per day) to stuck with status: io-error.
Not responsive - i have to stop and start those again (from CLI using qm stop command). I have spend hours with chat gpt and can not find the root cause.

I can see that status and qemu claims that I/O status: nospace:

Code:
root@proxmox3:~# qm status 103
status: io-error


root@proxmox3:~# qm monitor 103
Entering QEMU Monitor for VM 103 - type 'help' for help
qm> info block
ide2: [not inserted]
    Attached to:      ide2
    Removable device: not locked, tray closed

drive-scsi0: json:{"throttle-group": "throttle-drive-scsi0", "driver": "throttle", "file": {"driver": "qcow2", "file": {"driver": "file", "filename": "/var/lib/vz/images/103/vm-103-disk-0.qcow2"}}} (throttle)
    Attached to:      scsi0
    I/O status:       nospace
    Cache mode:       writeback
    Detect zeroes:    unmap

But do not understand why. There is a lot of storage on proxmox (free 500GB+ all the time, no spikes).
There is a lot of free storage inside that VM (Splunk instance it consuming maybe 50%, df -h showing a lot of space)

I have checked all possible logs in proxmox and nothing suggesting any storage issue, nor any disk full.
That same inside VM - nothing suggesting that it was running out of storage.

Why qemu claims that I/O status issues and locks my VM ? Any ideas ? (imho it's somehow related to ZFS....)
Can it be that ZFS with storage using discard=on and fstrim running on VM periodically is causing this ?
Should not i believe, but maybe bug ?

Thanks,
Michal
 
Last edited:
OK, maybe it will help somebody:
I have migrated those heavy IO bound VM machines from qcow stored in local ZFS storage to ZFS pool and problem disappeared. That was chatgpt recommendations (and also ZFS maintainers)