Random IO Error - Windows Server 2025

Hello,

Just some feedback on this Friday. I have had absolutely no issues this week. I usually had an issue per day or two max prior to the changes I did. So it seems using qcow2 or raw disk images over zfs is causing issues. Since my setup was not correct and I am running a cluster and I have backups, I moved the VMs to other nodes, reinstalled the empty node with a data directory over ext4, then moved back the VMs until I was done.

I did not do extensive performance test but from normal usage, I see no difference in response time.

I will evaluate creating new nodes with zfs and the requirements that come with it but for now, I am happy to have no issues.

Thanks to everyone that commented and helped! :)
 
Code:
zpool status
  pool: data
 state: ONLINE
config:

    NAME                                                 STATE     READ WRITE CKSUM
    data                                                 ONLINE       0     0     0
      mirror-0                                           ONLINE       0     0     0
        nvme-eui.333558304b7083240025385800000001-part5  ONLINE       0     0     0
        nvme-eui.333558304b7083450025385800000001-part5  ONLINE       0     0     0

errors: No known data errors

Code:
lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
nvme1n1     259:0    0 894.3G  0 disk
├─nvme1n1p1 259:1    0   511M  0 part
│ └─md1       9:1    0 510.9M  0 raid1 /boot/efi
├─nvme1n1p2 259:2    0     1G  0 part
│ └─md2       9:2    0  1022M  0 raid1 /boot
├─nvme1n1p3 259:3    0    20G  0 part
│ └─md3       9:3    0    20G  0 raid1 /
├─nvme1n1p4 259:4    0     1G  0 part  [SWAP]
└─nvme1n1p5 259:5    0 871.8G  0 part
nvme0n1     259:6    0 894.3G  0 disk
├─nvme0n1p1 259:7    0   511M  0 part
│ └─md1       9:1    0 510.9M  0 raid1 /boot/efi
├─nvme0n1p2 259:8    0     1G  0 part
│ └─md2       9:2    0  1022M  0 raid1 /boot
├─nvme0n1p3 259:9    0    20G  0 part
│ └─md3       9:3    0    20G  0 raid1 /
├─nvme0n1p4 259:10   0     1G  0 part  [SWAP]
├─nvme0n1p5 259:11   0 871.8G  0 part
└─nvme0n1p6 259:12   0     2M  0 part

Code:
mount | grep zfs
data/zd0 on /var/lib/vz type zfs (rw,relatime,xattr,posixacl,casesensitive)

Code:
df -h | grep zd0
data/zd0        838G  252G  586G  31% /var/lib/vz

Code:
pvesm status
Name         Type     Status     Total (KiB)      Used (KiB) Available (KiB)        %
local         dir     active       877855872       264064768       613791104   30.08%

Code:
qm config 105
agent: 1
allow-ksm: 0
balloon: 0
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 2
cpu: x86-64-v2-AES
efidisk0: local:105/vm-105-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
memory: 4096
meta: creation-qemu=8.1.2,ctime=1705047202
name: xxxx
net0: virtio=BC:24:11:27:61:00,bridge=vmbr1,firewall=1
net1: virtio=BC:24:11:DD:46:95,bridge=vmbr2,firewall=1
numa: 0
onboot: 1
ostype: l26
rng0: source=/dev/urandom
scsi0: local:105/vm-105-disk-1.qcow2,iothread=1,size=15G
scsi1: local:105/vm-105-disk-2.qcow2,iothread=1,size=60G
scsi2: local:105/vm-105-disk-3.qcow2,iothread=1,size=32G
scsihw: virtio-scsi-single
smbios1: uuid=da4d67ad-7563-41b5-81d4-d2e3d41f9318
sockets: 1
vga: virtio,memory=16
vmgenid: 58e70fb0-8fd6-4b57-ba4f-2e14c3990592
Code:
    {
      "io-status": "nospace",
      "device": "",
      "locked": false,
      "removable": false,
      "inserted": {
        "iops_rd": 0,
        "detect_zeroes": "on",
        "active": true,
        "image": {
          "backing-image": {
            "virtual-size": 64424509440,
            "filename": "/var/lib/vz/images/105/vm-105-disk-2.qcow2",
            "cluster-size": 65536,
            "format": "qcow2",
            "actual-size": 33055646208,
            "format-specific": {
              "type": "qcow2",
              "data": {
                "compat": "1.1",
                "compression-type": "zlib",
                "lazy-refcounts": false,
                "refcount-bits": 16,
                "corrupt": false,
                "extended-l2": false
              }
            },
            "dirty-flag": false
          },
          "virtual-size": 64424509440,
          "filename": "json:{\"throttle-group\": \"throttle-drive-scsi1\", \"driver\": \"throttle\", \"file\": {\"driver\": \"qcow2\", \"file\": {\"driver\": \"file\", \"filename\": \"/var/lib/vz/images/105/vm-105-disk-2.qcow2\"}}}",
          "cluster-size": 65536,
          "format": "throttle",
          "actual-size": 33055646208,
          "dirty-flag": false
        },
        "iops_wr": 0,
        "ro": false,
        "children": [
          {
            "node-name": "f142383428b6e23abe92f5be24f64b6",
            "child": "file"
          }
        ],
        "node-name": "drive-scsi1",
        "backing_file_depth": 1,
        "drv": "throttle",
        "iops": 0,
        "bps_wr": 0,
        "write_threshold": 0,
        "encrypted": false,
        "bps": 0,
        "bps_rd": 0,
        "cache": {
          "no-flush": false,
          "direct": false,
          "writeback": true
        },
        "file": "json:{\"throttle-group\": \"throttle-drive-scsi1\", \"driver\": \"throttle\", \"file\": {\"driver\": \"qcow2\", \"file\": {\"driver\": \"file\", \"filename\": \"/var/lib/vz/images/105/vm-105-disk-2.qcow2\"}}}"
      },
      "qdev": "scsi1",
      "type": "unknown"
    },

and for other vm
Code:
    {
      "io-status": "nospace",
      "device": "",
      "locked": false,
      "removable": false,
      "inserted": {
        "iops_rd": 0,
        "detect_zeroes": "on",
        "active": true,
        "image": {
          "backing-image": {
            "virtual-size": 57982058496,
            "filename": "/var/lib/vz/images/100/vm-100-disk-0.qcow2",
            "cluster-size": 65536,
            "format": "qcow2",
            "actual-size": 33758224896,
            "format-specific": {
              "type": "qcow2",
              "data": {
                "compat": "1.1",
                "compression-type": "zlib",
                "lazy-refcounts": false,
                "refcount-bits": 16,
                "corrupt": false,
                "extended-l2": false
              }
            },
            "dirty-flag": false
          },
          "virtual-size": 57982058496,
          "filename": "json:{\"throttle-group\": \"throttle-drive-scsi0\", \"driver\": \"throttle\", \"file\": {\"driver\": \"qcow2\", \"file\": {\"driver\": \"file\", \"filename\": \"/var/lib/vz/images/100/vm-100-disk-0.qcow2\"}}}",
          "cluster-size": 65536,
          "format": "throttle",
          "actual-size": 33758224896,
          "dirty-flag": false
        },
        "iops_wr": 0,
        "ro": false,
        "children": [
          {
            "node-name": "f14746557d2a99c86b817f8f3ed7182",
            "child": "file"
          }
        ],
        "node-name": "drive-scsi0",
        "backing_file_depth": 1,
        "drv": "throttle",
        "iops": 0,
        "bps_wr": 0,
        "write_threshold": 0,
        "encrypted": false,
        "bps": 0,
        "bps_rd": 0,
        "cache": {
          "no-flush": false,
          "direct": false,
          "writeback": true
        },
        "file": "json:{\"throttle-group\": \"throttle-drive-scsi0\", \"driver\": \"throttle\", \"file\": {\"driver\": \"qcow2\", \"file\": {\"driver\": \"file\", \"filename\": \"/var/lib/vz/images/100/vm-100-disk-0.qcow2\"}}}"
      },
      "qdev": "scsi0",
      "type": "unknown"
    },

Code:
lscpu
Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             43 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      32
  On-line CPU(s) list:       0-31
Vendor ID:                   AuthenticAMD
  Model name:                AMD EPYC 7371 16-Core Processor
    CPU family:              23
    Model:                   1
    Thread(s) per core:      2
    Core(s) per socket:      16
    Socket(s):               1
    Stepping:                2
    Frequency boost:         enabled
    CPU(s) scaling MHz:      104%
    CPU max MHz:             3100.0000
    CPU min MHz:             2500.0000
    BogoMIPS:                6188.32
    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1
                             gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 ss
                             e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw
                              skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2
                              rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_c
                             lean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
Virtualization features:
  Virtualization:            AMD-V
Caches (sum of all):
  L1d:                       512 KiB (16 instances)
  L1i:                       1 MiB (16 instances)
  L2:                        8 MiB (16 instances)
  L3:                        64 MiB (8 instances)
NUMA:
  NUMA node(s):              4
  NUMA node0 CPU(s):         0-3,16-19
  NUMA node1 CPU(s):         4-7,20-23
  NUMA node2 CPU(s):         8-11,24-27
  NUMA node3 CPU(s):         12-15,28-31
Vulnerabilities:
  Gather data sampling:      Not affected
  Ghostwrite:                Not affected
  Indirect target selection: Not affected
  Itlb multihit:             Not affected
  L1tf:                      Not affected
  Mds:                       Not affected
  Meltdown:                  Not affected
  Mmio stale data:           Not affected
  Old microcode:             Not affected
  Reg file data sampling:    Not affected
  Retbleed:                  Mitigation; untrained return thunk; SMT vulnerable
  Spec rstack overflow:      Mitigation; Safe RET
  Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:                Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                     Not affected
  Tsa:                       Not affected
  Tsx async abort:           Not affected
  Vmscape:                   Mitigation; IBPB before exit to userspace

For two weeks everything was fine. And today this bug(?) hits two vms. One shortly after node reboot. both has "io-status": "nospace". i you need more data just tell me.
 
this issue still presist, even after migration from qcow2 to raw
trying
zfs set direct=disabled data
from other threads
 
Last edited:
Yo solventé el problema al 100% bajando la versión de los drivers a la 271, desde ese momento no he vuelto a tener ningún cuelgue en los sistemas
 
Hello,

Just some feedback on this Friday. I have had absolutely no issues this week. I usually had an issue per day or two max prior to the changes I did. So it seems using qcow2 or raw disk images over zfs is causing issues. Since my setup was not correct and I am running a cluster and I have backups, I moved the VMs to other nodes, reinstalled the empty node with a data directory over ext4, then moved back the VMs until I was done.

I did not do extensive performance test but from normal usage, I see no difference in response time.

I will evaluate creating new nodes with zfs and the requirements that come with it but for now, I am happy to have no issues.

Thanks to everyone that commented and helped! :)
Since that moment (early December), no issues. If that can be of any use to someone... ext4 everywhere. Not benefiting from ZFS but I don't need it in my setup. Performance is excellent, absolutely no problems.
 
We are seeing this problem randomly (maybe once every few months or so) on our systems as well. VM fails with `io-error` and has to be powered off.

Similar to OP: We have a ZFS directory with qcow2 images on them. Many TBs of space available. No errors on journalctl, dmesg, zpool status. After rebooting the box, it doesn't happen again until a few months pass and then it just pops up randomly for no reason.

This particular instance happened when we were running Windows update.