Random IO Error - Windows Server 2025

Hello,

Just some feedback on this Friday. I have had absolutely no issues this week. I usually had an issue per day or two max prior to the changes I did. So it seems using qcow2 or raw disk images over zfs is causing issues. Since my setup was not correct and I am running a cluster and I have backups, I moved the VMs to other nodes, reinstalled the empty node with a data directory over ext4, then moved back the VMs until I was done.

I did not do extensive performance test but from normal usage, I see no difference in response time.

I will evaluate creating new nodes with zfs and the requirements that come with it but for now, I am happy to have no issues.

Thanks to everyone that commented and helped! :)
 
Code:
zpool status
  pool: data
 state: ONLINE
config:

    NAME                                                 STATE     READ WRITE CKSUM
    data                                                 ONLINE       0     0     0
      mirror-0                                           ONLINE       0     0     0
        nvme-eui.333558304b7083240025385800000001-part5  ONLINE       0     0     0
        nvme-eui.333558304b7083450025385800000001-part5  ONLINE       0     0     0

errors: No known data errors

Code:
lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
nvme1n1     259:0    0 894.3G  0 disk
├─nvme1n1p1 259:1    0   511M  0 part
│ └─md1       9:1    0 510.9M  0 raid1 /boot/efi
├─nvme1n1p2 259:2    0     1G  0 part
│ └─md2       9:2    0  1022M  0 raid1 /boot
├─nvme1n1p3 259:3    0    20G  0 part
│ └─md3       9:3    0    20G  0 raid1 /
├─nvme1n1p4 259:4    0     1G  0 part  [SWAP]
└─nvme1n1p5 259:5    0 871.8G  0 part
nvme0n1     259:6    0 894.3G  0 disk
├─nvme0n1p1 259:7    0   511M  0 part
│ └─md1       9:1    0 510.9M  0 raid1 /boot/efi
├─nvme0n1p2 259:8    0     1G  0 part
│ └─md2       9:2    0  1022M  0 raid1 /boot
├─nvme0n1p3 259:9    0    20G  0 part
│ └─md3       9:3    0    20G  0 raid1 /
├─nvme0n1p4 259:10   0     1G  0 part  [SWAP]
├─nvme0n1p5 259:11   0 871.8G  0 part
└─nvme0n1p6 259:12   0     2M  0 part

Code:
mount | grep zfs
data/zd0 on /var/lib/vz type zfs (rw,relatime,xattr,posixacl,casesensitive)

Code:
df -h | grep zd0
data/zd0        838G  252G  586G  31% /var/lib/vz

Code:
pvesm status
Name         Type     Status     Total (KiB)      Used (KiB) Available (KiB)        %
local         dir     active       877855872       264064768       613791104   30.08%

Code:
qm config 105
agent: 1
allow-ksm: 0
balloon: 0
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 2
cpu: x86-64-v2-AES
efidisk0: local:105/vm-105-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
memory: 4096
meta: creation-qemu=8.1.2,ctime=1705047202
name: xxxx
net0: virtio=BC:24:11:27:61:00,bridge=vmbr1,firewall=1
net1: virtio=BC:24:11:DD:46:95,bridge=vmbr2,firewall=1
numa: 0
onboot: 1
ostype: l26
rng0: source=/dev/urandom
scsi0: local:105/vm-105-disk-1.qcow2,iothread=1,size=15G
scsi1: local:105/vm-105-disk-2.qcow2,iothread=1,size=60G
scsi2: local:105/vm-105-disk-3.qcow2,iothread=1,size=32G
scsihw: virtio-scsi-single
smbios1: uuid=da4d67ad-7563-41b5-81d4-d2e3d41f9318
sockets: 1
vga: virtio,memory=16
vmgenid: 58e70fb0-8fd6-4b57-ba4f-2e14c3990592
Code:
    {
      "io-status": "nospace",
      "device": "",
      "locked": false,
      "removable": false,
      "inserted": {
        "iops_rd": 0,
        "detect_zeroes": "on",
        "active": true,
        "image": {
          "backing-image": {
            "virtual-size": 64424509440,
            "filename": "/var/lib/vz/images/105/vm-105-disk-2.qcow2",
            "cluster-size": 65536,
            "format": "qcow2",
            "actual-size": 33055646208,
            "format-specific": {
              "type": "qcow2",
              "data": {
                "compat": "1.1",
                "compression-type": "zlib",
                "lazy-refcounts": false,
                "refcount-bits": 16,
                "corrupt": false,
                "extended-l2": false
              }
            },
            "dirty-flag": false
          },
          "virtual-size": 64424509440,
          "filename": "json:{\"throttle-group\": \"throttle-drive-scsi1\", \"driver\": \"throttle\", \"file\": {\"driver\": \"qcow2\", \"file\": {\"driver\": \"file\", \"filename\": \"/var/lib/vz/images/105/vm-105-disk-2.qcow2\"}}}",
          "cluster-size": 65536,
          "format": "throttle",
          "actual-size": 33055646208,
          "dirty-flag": false
        },
        "iops_wr": 0,
        "ro": false,
        "children": [
          {
            "node-name": "f142383428b6e23abe92f5be24f64b6",
            "child": "file"
          }
        ],
        "node-name": "drive-scsi1",
        "backing_file_depth": 1,
        "drv": "throttle",
        "iops": 0,
        "bps_wr": 0,
        "write_threshold": 0,
        "encrypted": false,
        "bps": 0,
        "bps_rd": 0,
        "cache": {
          "no-flush": false,
          "direct": false,
          "writeback": true
        },
        "file": "json:{\"throttle-group\": \"throttle-drive-scsi1\", \"driver\": \"throttle\", \"file\": {\"driver\": \"qcow2\", \"file\": {\"driver\": \"file\", \"filename\": \"/var/lib/vz/images/105/vm-105-disk-2.qcow2\"}}}"
      },
      "qdev": "scsi1",
      "type": "unknown"
    },

and for other vm
Code:
    {
      "io-status": "nospace",
      "device": "",
      "locked": false,
      "removable": false,
      "inserted": {
        "iops_rd": 0,
        "detect_zeroes": "on",
        "active": true,
        "image": {
          "backing-image": {
            "virtual-size": 57982058496,
            "filename": "/var/lib/vz/images/100/vm-100-disk-0.qcow2",
            "cluster-size": 65536,
            "format": "qcow2",
            "actual-size": 33758224896,
            "format-specific": {
              "type": "qcow2",
              "data": {
                "compat": "1.1",
                "compression-type": "zlib",
                "lazy-refcounts": false,
                "refcount-bits": 16,
                "corrupt": false,
                "extended-l2": false
              }
            },
            "dirty-flag": false
          },
          "virtual-size": 57982058496,
          "filename": "json:{\"throttle-group\": \"throttle-drive-scsi0\", \"driver\": \"throttle\", \"file\": {\"driver\": \"qcow2\", \"file\": {\"driver\": \"file\", \"filename\": \"/var/lib/vz/images/100/vm-100-disk-0.qcow2\"}}}",
          "cluster-size": 65536,
          "format": "throttle",
          "actual-size": 33758224896,
          "dirty-flag": false
        },
        "iops_wr": 0,
        "ro": false,
        "children": [
          {
            "node-name": "f14746557d2a99c86b817f8f3ed7182",
            "child": "file"
          }
        ],
        "node-name": "drive-scsi0",
        "backing_file_depth": 1,
        "drv": "throttle",
        "iops": 0,
        "bps_wr": 0,
        "write_threshold": 0,
        "encrypted": false,
        "bps": 0,
        "bps_rd": 0,
        "cache": {
          "no-flush": false,
          "direct": false,
          "writeback": true
        },
        "file": "json:{\"throttle-group\": \"throttle-drive-scsi0\", \"driver\": \"throttle\", \"file\": {\"driver\": \"qcow2\", \"file\": {\"driver\": \"file\", \"filename\": \"/var/lib/vz/images/100/vm-100-disk-0.qcow2\"}}}"
      },
      "qdev": "scsi0",
      "type": "unknown"
    },

Code:
lscpu
Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             43 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      32
  On-line CPU(s) list:       0-31
Vendor ID:                   AuthenticAMD
  Model name:                AMD EPYC 7371 16-Core Processor
    CPU family:              23
    Model:                   1
    Thread(s) per core:      2
    Core(s) per socket:      16
    Socket(s):               1
    Stepping:                2
    Frequency boost:         enabled
    CPU(s) scaling MHz:      104%
    CPU max MHz:             3100.0000
    CPU min MHz:             2500.0000
    BogoMIPS:                6188.32
    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1
                             gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 ss
                             e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw
                              skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2
                              rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_c
                             lean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
Virtualization features:
  Virtualization:            AMD-V
Caches (sum of all):
  L1d:                       512 KiB (16 instances)
  L1i:                       1 MiB (16 instances)
  L2:                        8 MiB (16 instances)
  L3:                        64 MiB (8 instances)
NUMA:
  NUMA node(s):              4
  NUMA node0 CPU(s):         0-3,16-19
  NUMA node1 CPU(s):         4-7,20-23
  NUMA node2 CPU(s):         8-11,24-27
  NUMA node3 CPU(s):         12-15,28-31
Vulnerabilities:
  Gather data sampling:      Not affected
  Ghostwrite:                Not affected
  Indirect target selection: Not affected
  Itlb multihit:             Not affected
  L1tf:                      Not affected
  Mds:                       Not affected
  Meltdown:                  Not affected
  Mmio stale data:           Not affected
  Old microcode:             Not affected
  Reg file data sampling:    Not affected
  Retbleed:                  Mitigation; untrained return thunk; SMT vulnerable
  Spec rstack overflow:      Mitigation; Safe RET
  Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:                Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                     Not affected
  Tsa:                       Not affected
  Tsx async abort:           Not affected
  Vmscape:                   Mitigation; IBPB before exit to userspace

For two weeks everything was fine. And today this bug(?) hits two vms. One shortly after node reboot. both has "io-status": "nospace". i you need more data just tell me.