VMs becomes unresponsive

Reddin · Jan 11, 2022

Good day.

I have some VMs with configured storage on nfs and all of a sudden they became unresponsive. CPU stays at 25% (14% for some of them). Disk r/w drops to 0. Nothing on logs. VM can be stopped only from console with qm stop <VMID>. Where should I look?

PVE version 6.3-4

fiona · Jan 11, 2022

Hi,
I'd try upgrading to a newer version (Proxmox VE 6.3 is a year old). If you're lucky the new kernel/QEMU will help. If the issue happens again after upgrading, have a look at /var/log/syslog. Is the network stable?

Reddin · Jan 11, 2022

Yes, I was thinking about the problems with a network and tried to take a look into syslog but nothing there. Network is rock stable.

Reddin · Jan 11, 2022

Jan 11 11:47:46 tony pvedaemon[27757]: VM 1151 qmp command failed - VM 1151 qmp command 'query-proxmox-support' failed - unable to connect to VM 1151 qmp socket - timeout after 31 retries

That what comes from a syslog with one of unresponsive VMs.

Reddin · Jan 11, 2022

Made an upgrade to the latest version. Will keep an eye on those VMs.

kaja · Apr 12, 2023

Hi,

was searching forums for this exact VM beaviour, which I am observing for few days/weeks.

State ----------
Version: PVE 7.4-3
VM: about 4 running VMs (debian 11-stable and 12-testing)
CPU: 2 cores for each VM
Apps: some VMs have Docker installed, other just few ordinary apps
QEMU guest agent: enabled for all VMs

Condition ----------
Once a few days some (random) of the VM overloads one of its CPU core (shows 50% load of 2 cpus) and becomes unresponsive - both ssh or PVE console. Stays in this state until manually stopped ("shutdown" won't work).

It happend to various VMs through the last month. Debian 11 as well as 12, with docker installed inside VM or without...
I have upgraded PVE to 7.4-3 from whatever version was there before - still happens after upgrade as well as before.

RAM usage of the VM dropped cca from 600Mi to 400Mi at the time of the fault event.

Also I haven't noticed any network issues either.

Logs ----------
I haven't found any log showing any kind of problem. No indication in /var/log/syslog nor in the systemd logs (checked for few daemons I explicitly work with).
Last logged messages in /var/log/syslog are something like:

Apr 10 22:45:09 dock-first systemd[1]: Reached target Sound Card.
Apr 10 23:00:30 dock-first systemd[1]: Starting Cleanup of Temporary Directories...
Apr 10 23:00:30 dock-first systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Apr 10 23:00:30 dock-first systemd[1]: Finished Cleanup of Temporary Directories.

-> and silence right after that, until new boot

Possible context ----------
When stopping and starting such VM again, my bluetooth connection drops - meaning bluetooth usb adapter with passthrough (device, not port) to different (not affected) VM. Might be independent issue/bug though.

fiona · Apr 12, 2023

kaja said:
Hi,

was searching forums for this exact VM beaviour, which I am observing for few days/weeks.

State ----------
Version: PVE 7.4-3
VM: about 4 running VMs (debian 11-stable and 12-testing)
CPU: 2 cores for each VM
Apps: some VMs have Docker installed, other just few ordinary apps
QEMU guest agent: enabled for all VMs

Can you share the output of pveversion -v and qm config <ID> witht the IDs of the affected VMs? What host CPU do you have?

kaja said:
Condition ----------
Once a few days some (random) of the VM overloads one of its CPU core (shows 50% load of 2 cpus) and becomes unresponsive - both ssh or PVE console. Stays in this state until manually stopped ("shutdown" won't work).

It happend to various VMs through the last month. Debian 11 as well as 12, with docker installed inside VM or without...
I have upgraded PVE to 7.4-3 from whatever version was there before - still happens after upgrade as well as before.

RAM usage of the VM dropped cca from 600Mi to 400Mi at the time of the fault event.

Also I haven't noticed any network issues either.

When a VM gets stuck again, can you run strace -c -p $(cat /var/run/qemu-server/<ID>.pid) with the ID of the stuck VM? Press Ctrl+C after a few seconds to get the output.

kaja said:
Logs ----------
I haven't found any log showing any kind of problem. No indication in /var/log/syslog nor in the systemd logs (checked for few daemons I explicitly work with).
Last logged messages in /var/log/syslog are something like:

Did you also check the logs on the host?

kaja said:
Possible context ----------
When stopping and starting such VM again, my bluetooth connection drops - meaning bluetooth usb adapter with passthrough (device, not port) to different (not affected) VM. Might be independent issue/bug though.

Is this the only device you pass-through or are there others? Are the IOMMUs isolated enough for your use case: https://pve.proxmox.com/wiki/PCI_Passthrough#Verify_IOMMU_Isolation ?

kaja · Apr 13, 2023

fiona said:
Can you share the output of pveversion -v and qm config <ID> witht the IDs of the affected VMs? What host CPU do you have?

When a VM gets stuck again, can you run strace -c -p $(cat /var/run/qemu-server/<ID>.pid) with the ID of the stuck VM? Press Ctrl+C after a few seconds to get the output.

Did you also check the logs on the host?

Is this the only device you pass-through or are there others? Are the IOMMUs isolated enough for your use case: https://pve.proxmox.com/wiki/PCI_Passthrough#Verify_IOMMU_Isolation ?

Sharing output ----------
pveversion -v

proxmox-ve: 7.4-1 (running kernel: 5.15.104-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.4-1
pve-kernel-5.15.104-1-pve: 5.15.104-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-4
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-1
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-1
pve-firmware: 3.6-4
pve-ha-manager: 3.6.0
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

qm config 103

agent: 1
balloon: 1024
boot: order=scsi0;ide2;net0
cores: 2
ide2: local:iso/debian-11.6.0-amd64-netinst.iso,media=cdrom,size=388M
memory: 2048
meta: creation-qemu=7.1.0,ctime=1676657118
name: dock-first
net0: virtio=46:5D:0D:F2:FB:FE,bridge=vmbr0,firewall=1,tag=50
numa: 0
onboot: 1
ostype: l26
parent: fifo_only_and_carlos_hudba
scsi0: local-zfs:vm-103-disk-0,iothread=1,size=10G
scsi1: local-zfs:vm-103-disk-1,backup=0,iothread=1,size=1G
scsihw: virtio-scsi-single
smbios1: uuid=b9742f71-7049-494c-9a05-1246536245b7
sockets: 1
usb0: host=0a5c:2101
usb1: host=0bda:8771
usb2: host=0d8c:0014
vmgenid: d4aeaf0b-6c4c-47db-a5e4-7c5443fad2a1

qm config 107

agent: 1
balloon: 1024
boot: order=scsi0;ide2;net0
cores: 2
description: scsi1%3A local-zfs%3Avm-107-disk-1,backup=0,iothread=1,size=1G%0Ascsi1%3A local-zfs%3Avm-107-disk-1,backup=0,iothread=1,size=1G%0Ascsi1%3A local-zfs%3Avm-107-disk-1,backup=0,iothread=1,size=1G
ide2: local:iso/debian-11.6.0-amd64-netinst.iso,media=cdrom,size=388M
memory: 2048
meta: creation-qemu=7.1.0,ctime=1676657118
name: audio-server
net0: virtio=72:C4:1D:3D:E4:16,bridge=vmbr0,firewall=1,tag=50
numa: 0
onboot: 1
ostype: l26
scsi0: local-zfs:vm-107-disk-0,iothread=1,size=10G
scsi1: local-zfs:vm-107-disk-1,backup=0,iothread=1,size=1G
scsihw: virtio-scsi-single
smbios1: uuid=50a2ccf2-7d3c-44ec-ae3e-0d4339b09423
sockets: 1
tablet: 0
usb0: host=0d8c:0014
usb1: host=0bda:8771
usb2: host=0a5c:2101
vmgenid: f57c0d43-93dc-4ace-8395-2800298e3fe8

qm config 108

agent: 1
balloon: 1024
boot: order=scsi0;ide2;net0
cores: 2
description: Bacha - je tu disk pro swap, kter%C3%BD se nez%C3%A1lohuje... a p%C5%99i restoru je pot%C5%99eba jej p%C5%99ipojit JE%C5%A0T%C4%9A P%C5%98EDT%C3%8DM, ne%C5%BE ma%C5%A1inu spust%C3%AD%C5%A1, jinak bys to musel zase ru%C4%8Dn%C4%9B nastavovat ten swap...
ide2: local:iso/debian-11.6.0-amd64-netinst.iso,media=cdrom,size=388M
memory: 4096
meta: creation-qemu=7.1.0,ctime=1676657118
name: dock-home
net0: virtio=1A:E9:F5:C9:0A:E8,bridge=vmbr0,firewall=1,tag=50
numa: 0
onboot: 1
ostype: l26
parent: po_snapcastu_a_verce
scsi0: local-zfs:vm-108-disk-0,iothread=1,size=16G
scsi1: local-zfs:vm-108-disk-1,backup=0,iothread=1,size=1G
scsihw: virtio-scsi-single
smbios1: uuid=119a3435-c2a6-47be-b0c8-bf8557b2b1f5
sockets: 1
vmgenid: b613dc23-ca3e-4da2-9a7c-39cc41a933a4

Host processor ----------
Intel Celeron Elkhart Lake J6412, 2GHz, max 2.6GHz, quad core, AES-NI

Host log -----------
I forgot to check the host, so here it is:

Some time around the failure (I don't know the exact time :-/ ) I found hundreds of repeated messages per second like:
Apr 11 00:27:57 hyper-home kernel: [ 6212.812606] usb 1-6: usbfs: process 3718 (CPU 1/KVM) did not claim interface 1 before use
Apr 11 00:27:57 hyper-home kernel: [ 6212.981158] usb 1-6: usbfs: process 3717 (CPU 0/KVM) did not claim interface 1 before use
Apr 11 00:27:57 hyper-home QEMU[3615]: kvm: libusb_set_interface_alt_setting: -99 [OTHER]

Don't know if it's the cause or the consequence though. Those two processes (3717 and 3718) are randomly altering in the log. It repeated once after 1 minute and then twice after 5 minutes. Somewhere among those messages I found:

Apr 11 00:34:46 hyper-home kernel: [ 6621.202944] perf: interrupt took too long (3131 > 3130), lowering kernel.perf_event_max_sample_rate to 63750

Rest ----------
I will send the info when it happens again.
I have two usb devices plugged - bluetooth adapter and external soundcard. They are both passed to single VM (107).
About IOMMU - I believe I used wrong term "passthrough", since only I did was "Add HW -> usb device -> use usb vendor ID". So I can "lsusb" see them in both host and the 107 VM and do I understand, that it is not "real passthrough" of device?

But for what it's worth, here is "lspci" for usb - it "shares" functions with RAM, but as I said - I believe I did not make real passthrough?
00:14.0 USB controller: Intel Corporation Device 4b7d (rev 11)
00:14.2 RAM memory: Intel Corporation Device 4b7f (rev 11)

fiona · Apr 13, 2023

kaja said:
qm config 103

usb0: host=0a5c:2101
usb1: host=0bda:8771
usb2: host=0d8c:0014

kaja said:
qm config 107

usb0: host=0d8c:0014
usb1: host=0bda:8771
usb2: host=0a5c:2101

kaja said:
Some time around the failure (I don't know the exact time :-/ ) I found hundreds of repeated messages per second like:
Apr 11 00:27:57 hyper-home kernel: [ 6212.812606] usb 1-6: usbfs: process 3718 (CPU 1/KVM) did not claim interface 1 before use
Apr 11 00:27:57 hyper-home kernel: [ 6212.981158] usb 1-6: usbfs: process 3717 (CPU 0/KVM) did not claim interface 1 before use
Apr 11 00:27:57 hyper-home QEMU[3615]: kvm: libusb_set_interface_alt_setting: -99 [OTHER]

I guess you can't connect the same device to multiple VMs (and have them both running at the same time), just like you can't do that physically. At least I wouldn't be surprised at all if that is the issue here.

kaja · Apr 13, 2023

fiona said:
I guess you can't connect the same device to multiple VMs (and have them both running at the same time), just like you can't do that physically. At least I wouldn't be surprised at all if that is the issue here.

Oh, thank you, didn't noticed that... I will check/correct it later today, but it is little odd though - I may have added one of the devices to both VMs (my fault), but not those other two :-X ...
Anyway, I will start from here and see what happens next. Thank you again.

kaja · Apr 13, 2023

Hi,

came this afternoon to repair my usb -> multiple VMs fail. When removing them from undesired VM, got error like:

Code:

Parameter verification failed. (400)
usb1: hotplug problem - VM 103 qmp command 'device_del' failed - got timeout

which I guess may be normal behaviour (considering faulty previous state). USB devices got succesfully removed from the VM though.

But, at the same time I found VM stuck up since noon - VMID 108, which has nothing to do with usb devices at all. Did the command as advised:

Code:

strace -c -p $(cat /var/run/qemu-server/108.pid)
strace: Process 4187 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0      4743           read
  0.00    0.000000           0     18448           write
  0.00    0.000000           0        21           close
  0.00    0.000000           0        89           sendmsg
  0.00    0.000000           0      4514           recvmsg
  0.00    0.000000           0        21           getsockname
  0.00    0.000000           0        42           fcntl
  0.00    0.000000           0      4842           ppoll
  0.00    0.000000           0        21           accept4
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000           0     32741           total

I also checked host's

Code:

/var/log/syslog

, but there was no error hours before or after the fail whatsoever (except every-ten-seconds error fetching datastores from pbs, which is not running at this moment - I do assume it's just generic "cant reach you on network" error, but I am mentioning it for sure).

I suppose the faulty usb passing might be the cause anyway (even when the stucked VM had no usb attached)...
Otherwise, I'll happily fetch some info, if it happens again.

Best regards,
Kaja

fiona · Apr 14, 2023

kaja said:

But, at the same time I found VM stuck up since noon - VMID 108, which has nothing to do with usb devices at all. Did the command as advised:

Code:

strace -c -p $(cat /var/run/qemu-server/108.pid)
strace: Process 4187 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0      4743           read
  0.00    0.000000           0     18448           write
  0.00    0.000000           0        21           close
  0.00    0.000000           0        89           sendmsg
  0.00    0.000000           0      4514           recvmsg
  0.00    0.000000           0        21           getsockname
  0.00    0.000000           0        42           fcntl
  0.00    0.000000           0      4842           ppoll
  0.00    0.000000           0        21           accept4
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000           0     32741           total

Hmm, it's strange that the timings are missing, but the calls distribution doesn't really stand out as problematic to me.

kaja said:
I also checked host's

Code:

/var/log/syslog

, but there was no error hours before or after the fail whatsoever (except every-ten-seconds error fetching datastores from pbs, which is not running at this moment - I do assume it's just generic "cant reach you on network" error, but I am mentioning it for sure).

If there were no errors, it could also be an issue within the guest itself. How did you determine that the VM is stuck, i.e. console/ping/etc.? Is there anything interesting in the guest's logs?

Yes, the PBS errors should not be related. You could disable the storage during the times it's not online to avoid the errors.

kaja said:
But for what it's worth, here is "lspci" for usb - it "shares" functions with RAM, but as I said - I believe I did not make real passthrough?
00:14.0 USB controller: Intel Corporation Device 4b7d (rev 11)
00:14.2 RAM memory: Intel Corporation Device 4b7f (rev 11)

Can you share the output of find /sys/kernel/iommu_groups/ -type l?

EDIT: @dcsapak told me that it should not be an issue here, USB passthrough uses a different mechanism.

kaja · Apr 14, 2023

fiona said:
If there were no errors, it could also be an issue within the guest itself. How did you determine that the VM is stuck, i.e. console/ping/etc.? Is there anything interesting in the guest's logs?

I suppose it is possible I am doing something systematically wrong in all VMs, but I don't feel I do anything extraordinary... I usually just set msmtp, unattended upgrades and ssh key... after that some of them have docker / usb device / something like snapcast/pipewire... and that's it.

Being stuck: PVE shows stable 50% cpu (I suppose one of two cores being maxed out until manual stop), can't ssh, ping nor get into through pve console of the VM. Although, the pve console did show some repeating message (just couldn't type any command), that I will try to printscreen next time :-/

The guest syslog itself had no error this time either. Just hourly cron report, silence and then booting messages after manual reboot. Maybe there is some handy systemd daemon that I might check, but I have no idea.

I do assume the "edit" about passthrough canceles the need fot that output, but anyway:

Code:

/sys/kernel/iommu_groups/17/devices/0000:06:00.0
/sys/kernel/iommu_groups/7/devices/0000:00:1c.1
/sys/kernel/iommu_groups/15/devices/0000:03:00.0
/sys/kernel/iommu_groups/5/devices/0000:00:17.0
/sys/kernel/iommu_groups/13/devices/0000:01:00.0
/sys/kernel/iommu_groups/3/devices/0000:00:14.2
/sys/kernel/iommu_groups/3/devices/0000:00:14.0
/sys/kernel/iommu_groups/11/devices/0000:00:1c.6
/sys/kernel/iommu_groups/1/devices/0000:00:02.0
/sys/kernel/iommu_groups/8/devices/0000:00:1c.2
/sys/kernel/iommu_groups/16/devices/0000:04:00.0
/sys/kernel/iommu_groups/6/devices/0000:00:1c.0
/sys/kernel/iommu_groups/14/devices/0000:02:00.0
/sys/kernel/iommu_groups/4/devices/0000:00:16.0
/sys/kernel/iommu_groups/12/devices/0000:00:1f.0
/sys/kernel/iommu_groups/12/devices/0000:00:1f.5
/sys/kernel/iommu_groups/12/devices/0000:00:1f.3
/sys/kernel/iommu_groups/12/devices/0000:00:1f.4
/sys/kernel/iommu_groups/2/devices/0000:00:08.0
/sys/kernel/iommu_groups/10/devices/0000:00:1c.4
/sys/kernel/iommu_groups/0/devices/0000:00:00.0
/sys/kernel/iommu_groups/9/devices/0000:00:1c.3

kaja · May 4, 2023

I was searching for another solutions the other day and I've found this:
https://forum.proxmox.com/threads/vm-freezes-irregularly.111494/page-33#post-552822
which seems to be the issue. Also, the solution there seems to work for me so far.
Tl,dr: Install "intel-microcode" package of version 3.20230214.1 on the proxmox host (it is in the debian bullseye contrib/non-free repository already at that very version). I suppose it is an intel processor issue of some generation or something...

Search

Search

VMs becomes unresponsive

Reddin

New Member

fiona

Proxmox Staff Member

Reddin

New Member

Reddin

New Member

Reddin

New Member

kaja

New Member

fiona

Proxmox Staff Member

kaja

New Member

fiona

Proxmox Staff Member

kaja

New Member

kaja

New Member

fiona

Proxmox Staff Member

kaja

New Member

kaja

New Member