Strange Disk IO

Dunuin

Distinguished Member
Jun 30, 2020
14,796
4,717
258
Germany
Hi,

I was just looking what the backup is doing I saw alot of disk IO in every running VM.
11 VMs were running on that server normally only doing 10kb/s to 500kb/s of disk io. But from around 22:00 to 5:30 (thats where the backup is starting and stutting down all VMs) all the 11 VMs were reading/writing like crazy.

Here some examples:
diskio1.pngdiskio2.png

Any idea what could have caused this?

I thought maybe a snapshot, backup or scrub but that wasn't the right time:
0 8 * * 0 /usr/sbin/zpool scrub VMpool7 > /dev/null
0 5 * * 0,3-6 /usr/local/bin/cv4pve-autosnap --host=127.0.0.1 --username=snapshot@pam --password='mypass' --vmid=all,-100,-125 --timeout=1800 snap --label='daily' --keep=12 > /var/log/cv4pve/cv4pve.log 2>&1
45 4 * * * /usr/local/bin/cv4pve-autosnap app-upgrade > /var/log/cv4pve/cv4pve.log 2>&1

And daily backups are started at 5:30.

So I really have no idea what happend there, especially because it happend with every VM (linux,freebsd and Win10).

All VMs are stored on a raidz1 of 5 SSDs.

Edit:
One of my VMs has crashed and it looks like one of the "qm set" passthroughed USB HDDs isn't working anymore. In the hosts syslog I see alot of messages like this:
Code:
Jun 27 00:14:28 Hypervisor kernel: [200837.404688] sd 14:0:0:0: [sdo] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 27 00:14:28 Hypervisor kernel: [200837.404707] sd 14:0:0:0: [sdo] tag#0 Sense Key : Not Ready [current]
Jun 27 00:14:28 Hypervisor kernel: [200837.404709] sd 14:0:0:0: [sdo] tag#0 Add. Sense: Logical unit is in process of becoming ready
Jun 27 00:14:28 Hypervisor kernel: [200837.404711] sd 14:0:0:0: [sdo] tag#0 CDB: Read(10) 28 00 20 6b 37 1a 00 00 01 00
Jun 27 00:14:28 Hypervisor kernel: [200837.404713] blk_update_request: I/O error, dev sdo, sector 4351178960 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jun 27 00:14:34 Hypervisor pvestatd[4118]: VM 125 qmp command failed - VM 125 qmp command 'query-proxmox-support' failed - got timeout
Jun 27 00:14:35 Hypervisor pvestatd[4118]: status update time (9.118 seconds)
Jun 27 00:14:42 Hypervisor pvestatd[4118]: VM 125 qmp command failed - VM 125 qmp command 'query-proxmox-support' failed - unable to connect to VM 125 qmp socket - timeout after 31 retries
Jun 27 00:14:43 Hypervisor pvestatd[4118]: status update time (6.934 seconds)
Jun 27 00:14:52 Hypervisor pvestatd[4118]: VM 125 qmp command failed - VM 125 qmp command 'query-proxmox-support' failed - unable to connect to VM 125 qmp socket - timeout after 31 retries
Jun 27 00:14:53 Hypervisor pvestatd[4118]: status update time (6.958 seconds)
Jun 27 00:15:00 Hypervisor systemd[1]: Starting Proxmox VE replication runner...
Jun 27 00:15:01 Hypervisor CRON[29715]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun 27 00:15:01 Hypervisor systemd[1]: pvesr.service: Succeeded.
Jun 27 00:15:01 Hypervisor systemd[1]: Started Proxmox VE replication runner.
Jun 27 00:15:02 Hypervisor pvestatd[4118]: VM 125 qmp command failed - VM 125 qmp command 'query-proxmox-support' failed - unable to connect to VM 125 qmp socket - timeout after 31 retries
Jun 27 00:15:03 Hypervisor pvestatd[4118]: status update time (6.951 seconds)
Jun 27 00:15:12 Hypervisor pvestatd[4118]: VM 125 qmp command failed - VM 125 qmp command 'query-proxmox-support' failed - unable to connect to VM 125 qmp socket - timeout after 31 retries

But why is this causing so much disk IO for all 10 other VMs on my ZFS pool? The failed disk isn't even using ZFS.
 
Last edited: