VM instance IO hangs

Davidrama

New Member
Oct 15, 2018
3
0
1
54
hello,

i'm running 5.2 with GPU passthrough (2x1080) for ML and facing some strange issues.
After a certain perdiod of time all my instances get their IO frozen, I can connect to my instances but no more job are running.
Got only 2 instances one running FreeNAS and the other dedicated to Machine Learning (using passthrough).

I need to restart my vms to get things back to normal.
Any idea ?

proxmox-ve: 5.2-2 (running kernel: 4.15.18-7-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-10
pve-kernel-4.15.18-7-pve: 4.15.18-26
pve-kernel-4.4.19-1-pve: 4.4.19-66
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-28
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-36
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1
 
After a certain perdiod of time all my instances get their IO frozen
how does your storage config look like? is it full maybe? does 'qm status ID --verbose' say anything interesting? any logs inside the vm?
 
Hi,

Storage has 8 disks:
- 1 SSD for GPU Machine
The rest are used by the Freenas vm as zfs pools - inside the VM (and thus that's why they show nearly full)
From 98% to 100%


Capture d’écran 2018-10-17 à 09.32.45.png



First VM (FreeNAS):
##########
root@prox4:~# qm status 100 --verbose
balloon: 33554432000
ballooninfo:
max_mem: 33554432000
actual: 33554432000
blockstat:
virtio5:
flush_operations: 740
failed_wr_operations: 0
timed_stats:
wr_highest_offset: 5798205652992
rd_total_time_ns: 407358124767
wr_merged: 1068
invalid_rd_operations: 0
failed_flush_operations: 0
rd_bytes: 1214873088
wr_operations: 221094
rd_merged: 1092
invalid_flush_operations: 0
wr_total_time_ns: 139040025286
invalid_wr_operations: 0
failed_rd_operations: 0
rd_operations: 28599
idle_time_ns: 106753367104209
flush_total_time_ns: 195889986931
wr_bytes: 2087962624
virtio2:
wr_highest_offset: 5798205652992
timed_stats:
flush_operations: 740
failed_wr_operations: 0
failed_flush_operations: 0
invalid_rd_operations: 0
rd_total_time_ns: 405002447458
wr_merged: 1079
rd_bytes: 1211065856
wr_operations: 219610
rd_merged: 1094
idle_time_ns: 109395010704041
failed_rd_operations: 0
rd_operations: 28581
wr_bytes: 2088062976
flush_total_time_ns: 197356784284
invalid_wr_operations: 0
invalid_flush_operations: 0
wr_total_time_ns: 145658953380
virtio1:
timed_stats:
wr_highest_offset: 5798205652992
flush_operations: 740
failed_wr_operations: 0
invalid_rd_operations: 0
failed_flush_operations: 0
rd_total_time_ns: 419883115952
wr_merged: 1077
rd_bytes: 1209669632
rd_merged: 1107
wr_operations: 219179
failed_rd_operations: 0
rd_operations: 28591
idle_time_ns: 150569857292777
flush_total_time_ns: 201091592253
wr_bytes: 2086914560
wr_total_time_ns: 159067651269
invalid_flush_operations: 0
invalid_wr_operations: 0
virtio4:
rd_merged: 1093
wr_operations: 219645
rd_bytes: 1217904640
flush_total_time_ns: 196624596430
wr_bytes: 2088122880
rd_operations: 28769
failed_rd_operations: 0
idle_time_ns: 109395220995365
wr_total_time_ns: 157425359160
invalid_flush_operations: 0
invalid_wr_operations: 0
timed_stats:
wr_highest_offset: 5798205652992
flush_operations: 740
failed_wr_operations: 0
invalid_rd_operations: 0
failed_flush_operations: 0
wr_merged: 1072
rd_total_time_ns: 416213022737
virtio3:
invalid_wr_operations: 0
invalid_flush_operations: 0
wr_total_time_ns: 139650704978
idle_time_ns: 106753363498497
failed_rd_operations: 0
rd_operations: 28737
wr_bytes: 2087859712
flush_total_time_ns: 197374774789
rd_bytes: 1218160640
wr_operations: 222130
rd_merged: 1057
rd_total_time_ns: 411224644333
wr_merged: 1074
failed_flush_operations: 0
invalid_rd_operations: 0
failed_wr_operations: 0
flush_operations: 740
wr_highest_offset: 5798205652992
timed_stats:
ide0:
failed_flush_operations: 0
invalid_rd_operations: 0
wr_merged: 0
rd_total_time_ns: 28631212465
wr_highest_offset: 107373940736
timed_stats:
failed_wr_operations: 0
flush_operations: 5283
wr_bytes: 3788649472
flush_total_time_ns: 1093989818067
idle_time_ns: 21115508450
rd_operations: 100656
failed_rd_operations: 0
invalid_wr_operations: 0
invalid_flush_operations: 0
wr_total_time_ns: 73854147055
wr_operations: 328294
rd_merged: 0
rd_bytes: 170229760
virtio6:
failed_wr_operations: 0
flush_operations: 0
wr_highest_offset: 0
timed_stats:
rd_total_time_ns: 22138970
wr_merged: 0
failed_flush_operations: 0
invalid_rd_operations: 0
rd_bytes: 1525760
wr_operations: 0
rd_merged: 0
invalid_wr_operations: 0
wr_total_time_ns: 0
invalid_flush_operations: 0
idle_time_ns: 156627569771034
rd_operations: 236
failed_rd_operations: 0
wr_bytes: 0
flush_total_time_ns: 0
cpus: 2
disk: 0
diskread: 6243429376
diskwrite: 14227572224
maxdisk: 107374182400
maxmem: 33554432000
mem: 31966096986
name: labnas
netin: 615260151
netout: 6359577217
nics:
tap100i0:
netout: 6359577217
netin: 615260151
pid: 25535
qmpstatus: running
status: running
template:
uptime: 156708
vmid: 100

######
Second VM (GPU)
######

root@prox4:~# qm status 103 --verbose
blockstat:
virtio3:
rd_bytes: 273832389120
timed_stats:
wr_total_time_ns: 27376269100931
rd_operations: 5123337
idle_time_ns: 6427725922587
invalid_wr_operations: 0
flush_total_time_ns: 33268640774
wr_merged: 193864
rd_total_time_ns: 2651625666858
invalid_flush_operations: 0
wr_bytes: 149046317568
flush_operations: 31231
rd_merged: 727963
invalid_rd_operations: 0
wr_operations: 1607039
failed_rd_operations: 0
failed_flush_operations: 0
failed_wr_operations: 0
wr_highest_offset: 2267905425408
virtio1:
rd_bytes: 9873920
wr_total_time_ns: 349343
timed_stats:
idle_time_ns: 356525089889476
rd_operations: 713
invalid_wr_operations: 0
flush_total_time_ns: 44479793
wr_merged: 0
rd_total_time_ns: 3579080449
invalid_flush_operations: 0
rd_merged: 374
invalid_rd_operations: 0
wr_operations: 3
wr_bytes: 8704
flush_operations: 3
failed_flush_operations: 0
failed_wr_operations: 0
wr_highest_offset: 101067001856
failed_rd_operations: 0
ide2:
failed_rd_operations: 0
failed_wr_operations: 0
wr_highest_offset: 0
failed_flush_operations: 0
flush_operations: 0
wr_bytes: 0
invalid_rd_operations: 0
wr_operations: 0
rd_merged: 0
rd_total_time_ns: 4260094
invalid_flush_operations: 0
wr_merged: 0
flush_total_time_ns: 0
invalid_wr_operations: 0
rd_operations: 31
idle_time_ns: 421709647094818
timed_stats:
wr_total_time_ns: 0
rd_bytes: 98528
virtio2:
failed_wr_operations: 0
wr_highest_offset: 101067001856
failed_flush_operations: 0
failed_rd_operations: 0
invalid_rd_operations: 0
wr_operations: 3
rd_merged: 527
flush_operations: 3
wr_bytes: 8704
rd_total_time_ns: 3875743145
invalid_flush_operations: 0
wr_merged: 0
flush_total_time_ns: 39725220
invalid_wr_operations: 0
idle_time_ns: 356524601426443
rd_operations: 1344
wr_total_time_ns: 332449
timed_stats:
rd_bytes: 12458496
scsi0:
wr_total_time_ns: 1643289257414
timed_stats:
rd_bytes: 5499833856
rd_operations: 425630
idle_time_ns: 3622188282
invalid_wr_operations: 0
flush_total_time_ns: 387816238867
wr_merged: 0
invalid_flush_operations: 0
rd_total_time_ns: 1572071115981
flush_operations: 20097
wr_bytes: 3004784640
invalid_rd_operations: 0
wr_operations: 83417
rd_merged: 0
failed_rd_operations: 0
failed_wr_operations: 0
wr_highest_offset: 192516268032
failed_flush_operations: 0
virtio0:
flush_total_time_ns: 29451853
invalid_wr_operations: 0
idle_time_ns: 356525306297910
rd_operations: 7134
wr_total_time_ns: 343660
timed_stats:
rd_bytes: 36174336
wr_highest_offset: 94624550912
failed_wr_operations: 0
failed_flush_operations: 0
failed_rd_operations: 0
invalid_rd_operations: 0
wr_operations: 3
rd_merged: 5906
flush_operations: 3
wr_bytes: 8704
invalid_flush_operations: 0
rd_total_time_ns: 40399984855
wr_merged: 0
cpus: 8
disk: 0
diskread: 279390828256
diskwrite: 152051128320
maxdisk: 193273528320
maxmem: 100663296000
mem: 98735997991
name: gpu-v2
netin: 2219256593
netout: 2719739754
nics:
tap103i0:
netin: 2219256593
netout: 2719739754
pid: 2441
qmpstatus: running
serial: 1
status: running
template:
uptime: 421783
vmid: 103
 
is that status during a frozen state?
 
Hi,

no just did it when you ask sorry.

would it be better to manage zfs via proxmox and not freenas ?

David
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!