VM instance IO hangs

Davidrama

New Member
Oct 15, 2018
3
0
1
54
hello,

i'm running 5.2 with GPU passthrough (2x1080) for ML and facing some strange issues.
After a certain perdiod of time all my instances get their IO frozen, I can connect to my instances but no more job are running.
Got only 2 instances one running FreeNAS and the other dedicated to Machine Learning (using passthrough).

I need to restart my vms to get things back to normal.
Any idea ?

proxmox-ve: 5.2-2 (running kernel: 4.15.18-7-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-10
pve-kernel-4.15.18-7-pve: 4.15.18-26
pve-kernel-4.4.19-1-pve: 4.4.19-66
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-28
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-36
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1
 
After a certain perdiod of time all my instances get their IO frozen
how does your storage config look like? is it full maybe? does 'qm status ID --verbose' say anything interesting? any logs inside the vm?
 
Hi,

Storage has 8 disks:
- 1 SSD for GPU Machine
The rest are used by the Freenas vm as zfs pools - inside the VM (and thus that's why they show nearly full)
From 98% to 100%


Capture d’écran 2018-10-17 à 09.32.45.png



First VM (FreeNAS):
##########
root@prox4:~# qm status 100 --verbose
balloon: 33554432000
ballooninfo:
max_mem: 33554432000
actual: 33554432000
blockstat:
virtio5:
flush_operations: 740
failed_wr_operations: 0
timed_stats:
wr_highest_offset: 5798205652992
rd_total_time_ns: 407358124767
wr_merged: 1068
invalid_rd_operations: 0
failed_flush_operations: 0
rd_bytes: 1214873088
wr_operations: 221094
rd_merged: 1092
invalid_flush_operations: 0
wr_total_time_ns: 139040025286
invalid_wr_operations: 0
failed_rd_operations: 0
rd_operations: 28599
idle_time_ns: 106753367104209
flush_total_time_ns: 195889986931
wr_bytes: 2087962624
virtio2:
wr_highest_offset: 5798205652992
timed_stats:
flush_operations: 740
failed_wr_operations: 0
failed_flush_operations: 0
invalid_rd_operations: 0
rd_total_time_ns: 405002447458
wr_merged: 1079
rd_bytes: 1211065856
wr_operations: 219610
rd_merged: 1094
idle_time_ns: 109395010704041
failed_rd_operations: 0
rd_operations: 28581
wr_bytes: 2088062976
flush_total_time_ns: 197356784284
invalid_wr_operations: 0
invalid_flush_operations: 0
wr_total_time_ns: 145658953380
virtio1:
timed_stats:
wr_highest_offset: 5798205652992
flush_operations: 740
failed_wr_operations: 0
invalid_rd_operations: 0
failed_flush_operations: 0
rd_total_time_ns: 419883115952
wr_merged: 1077
rd_bytes: 1209669632
rd_merged: 1107
wr_operations: 219179
failed_rd_operations: 0
rd_operations: 28591
idle_time_ns: 150569857292777
flush_total_time_ns: 201091592253
wr_bytes: 2086914560
wr_total_time_ns: 159067651269
invalid_flush_operations: 0
invalid_wr_operations: 0
virtio4:
rd_merged: 1093
wr_operations: 219645
rd_bytes: 1217904640
flush_total_time_ns: 196624596430
wr_bytes: 2088122880
rd_operations: 28769
failed_rd_operations: 0
idle_time_ns: 109395220995365
wr_total_time_ns: 157425359160
invalid_flush_operations: 0
invalid_wr_operations: 0
timed_stats:
wr_highest_offset: 5798205652992
flush_operations: 740
failed_wr_operations: 0
invalid_rd_operations: 0
failed_flush_operations: 0
wr_merged: 1072
rd_total_time_ns: 416213022737
virtio3:
invalid_wr_operations: 0
invalid_flush_operations: 0
wr_total_time_ns: 139650704978
idle_time_ns: 106753363498497
failed_rd_operations: 0
rd_operations: 28737
wr_bytes: 2087859712
flush_total_time_ns: 197374774789
rd_bytes: 1218160640
wr_operations: 222130
rd_merged: 1057
rd_total_time_ns: 411224644333
wr_merged: 1074
failed_flush_operations: 0
invalid_rd_operations: 0
failed_wr_operations: 0
flush_operations: 740
wr_highest_offset: 5798205652992
timed_stats:
ide0:
failed_flush_operations: 0
invalid_rd_operations: 0
wr_merged: 0
rd_total_time_ns: 28631212465
wr_highest_offset: 107373940736
timed_stats:
failed_wr_operations: 0
flush_operations: 5283
wr_bytes: 3788649472
flush_total_time_ns: 1093989818067
idle_time_ns: 21115508450
rd_operations: 100656
failed_rd_operations: 0
invalid_wr_operations: 0
invalid_flush_operations: 0
wr_total_time_ns: 73854147055
wr_operations: 328294
rd_merged: 0
rd_bytes: 170229760
virtio6:
failed_wr_operations: 0
flush_operations: 0
wr_highest_offset: 0
timed_stats:
rd_total_time_ns: 22138970
wr_merged: 0
failed_flush_operations: 0
invalid_rd_operations: 0
rd_bytes: 1525760
wr_operations: 0
rd_merged: 0
invalid_wr_operations: 0
wr_total_time_ns: 0
invalid_flush_operations: 0
idle_time_ns: 156627569771034
rd_operations: 236
failed_rd_operations: 0
wr_bytes: 0
flush_total_time_ns: 0
cpus: 2
disk: 0
diskread: 6243429376
diskwrite: 14227572224
maxdisk: 107374182400
maxmem: 33554432000
mem: 31966096986
name: labnas
netin: 615260151
netout: 6359577217
nics:
tap100i0:
netout: 6359577217
netin: 615260151
pid: 25535
qmpstatus: running
status: running
template:
uptime: 156708
vmid: 100

######
Second VM (GPU)
######

root@prox4:~# qm status 103 --verbose
blockstat:
virtio3:
rd_bytes: 273832389120
timed_stats:
wr_total_time_ns: 27376269100931
rd_operations: 5123337
idle_time_ns: 6427725922587
invalid_wr_operations: 0
flush_total_time_ns: 33268640774
wr_merged: 193864
rd_total_time_ns: 2651625666858
invalid_flush_operations: 0
wr_bytes: 149046317568
flush_operations: 31231
rd_merged: 727963
invalid_rd_operations: 0
wr_operations: 1607039
failed_rd_operations: 0
failed_flush_operations: 0
failed_wr_operations: 0
wr_highest_offset: 2267905425408
virtio1:
rd_bytes: 9873920
wr_total_time_ns: 349343
timed_stats:
idle_time_ns: 356525089889476
rd_operations: 713
invalid_wr_operations: 0
flush_total_time_ns: 44479793
wr_merged: 0
rd_total_time_ns: 3579080449
invalid_flush_operations: 0
rd_merged: 374
invalid_rd_operations: 0
wr_operations: 3
wr_bytes: 8704
flush_operations: 3
failed_flush_operations: 0
failed_wr_operations: 0
wr_highest_offset: 101067001856
failed_rd_operations: 0
ide2:
failed_rd_operations: 0
failed_wr_operations: 0
wr_highest_offset: 0
failed_flush_operations: 0
flush_operations: 0
wr_bytes: 0
invalid_rd_operations: 0
wr_operations: 0
rd_merged: 0
rd_total_time_ns: 4260094
invalid_flush_operations: 0
wr_merged: 0
flush_total_time_ns: 0
invalid_wr_operations: 0
rd_operations: 31
idle_time_ns: 421709647094818
timed_stats:
wr_total_time_ns: 0
rd_bytes: 98528
virtio2:
failed_wr_operations: 0
wr_highest_offset: 101067001856
failed_flush_operations: 0
failed_rd_operations: 0
invalid_rd_operations: 0
wr_operations: 3
rd_merged: 527
flush_operations: 3
wr_bytes: 8704
rd_total_time_ns: 3875743145
invalid_flush_operations: 0
wr_merged: 0
flush_total_time_ns: 39725220
invalid_wr_operations: 0
idle_time_ns: 356524601426443
rd_operations: 1344
wr_total_time_ns: 332449
timed_stats:
rd_bytes: 12458496
scsi0:
wr_total_time_ns: 1643289257414
timed_stats:
rd_bytes: 5499833856
rd_operations: 425630
idle_time_ns: 3622188282
invalid_wr_operations: 0
flush_total_time_ns: 387816238867
wr_merged: 0
invalid_flush_operations: 0
rd_total_time_ns: 1572071115981
flush_operations: 20097
wr_bytes: 3004784640
invalid_rd_operations: 0
wr_operations: 83417
rd_merged: 0
failed_rd_operations: 0
failed_wr_operations: 0
wr_highest_offset: 192516268032
failed_flush_operations: 0
virtio0:
flush_total_time_ns: 29451853
invalid_wr_operations: 0
idle_time_ns: 356525306297910
rd_operations: 7134
wr_total_time_ns: 343660
timed_stats:
rd_bytes: 36174336
wr_highest_offset: 94624550912
failed_wr_operations: 0
failed_flush_operations: 0
failed_rd_operations: 0
invalid_rd_operations: 0
wr_operations: 3
rd_merged: 5906
flush_operations: 3
wr_bytes: 8704
invalid_flush_operations: 0
rd_total_time_ns: 40399984855
wr_merged: 0
cpus: 8
disk: 0
diskread: 279390828256
diskwrite: 152051128320
maxdisk: 193273528320
maxmem: 100663296000
mem: 98735997991
name: gpu-v2
netin: 2219256593
netout: 2719739754
nics:
tap103i0:
netin: 2219256593
netout: 2719739754
pid: 2441
qmpstatus: running
serial: 1
status: running
template:
uptime: 421783
vmid: 103
 
is that status during a frozen state?
 
Hi,

no just did it when you ask sorry.

would it be better to manage zfs via proxmox and not freenas ?

David