[SOLVED] Specific VMID guest stops responding from host input, even after full reinstallation of PVE and VM re-creation-- please help!

lightheat

New Member
May 12, 2021
3
0
1
39
This is a very bizarre issue that I simply cannot explain, and I need help knowing where to look in the logs for clues as to what's happening. Here's what I see:

I'm hosting Proxmox baremetal on a PowerEdge R620 with a 20-core Xeon and 64GB RAM. When I had version 5.4-1 installed, I originally had 3 VMs set up: 100, 101, 102. All three of them were Xubuntu 20.04 images with between 1 and 4 cores and 3 to 32 GB of RAM. All 3 ran fine for a long time.

Shortly after I renamed 102 from "lab-video" to "lab-docker" (to re-purpose it), it began to "freeze" if I left it idle for more than a few minutes. I could still connect to the UI via the VNC Console, and the cursor would change appropriately to reflect what was under it (bar over text, arrows over window borders, etc), but I could not select or click anything. It just wouldn't respond to any keyboard or mouse input (beyond moving the cursor). Any process I had running on it, like a Handbrake encode, would continue to run unabated-- I just could no longer interact with the VM. So long as I kept actively using the VM via the Console, it would continue responding. It would only "freeze" if I left it unattended.

I got sick of the "freezing," so I decided to reinstall the OS on the guest. Just a clean wipe and reinstall. Still froze. Weird. The other 2 VMs are still acting fine. So I figured there might be something wrong with the RAM. I ran a full battery with Memtest86 (Passmark)-- no errors. OK, guess there's a bug in this version of Proxmox. Been meaning to upgrade anyway. Finally, I backed up 100, 101, my /etc directory, and performed a clean install upgrade of Proxmox on the server to 6.4-5.

I restored my storage.cfg file, restored 100 and 101 from backups, and created a brand spanking new VM for 102. I performed a clean install of Xubuntu 20.04.2, installed a handful of packages (Docker and its prerequisites as listed on the site), annnnnd it's doing the same thing again! And only on VMID 102! Pretty sure the number is cursed.

To add to the mystery, I just tried performing a host reboot and then reset of the VM, and both produced an error.

The reboot failed with:
Error: VM quit/powerdown failed - got timeout
And the reset failed with:
Error: can't lock file '/var/lock/qemu-server/lock-102.conf' - got timeout

I'm going out of my mind. I can't even stop the damned thing now. What is happening?

Here is some qm output below. Please let me know what else I can provide to help you to help me. Thanks in advance.

Code:
Last login: Wed May 12 21:41:16 EDT 2021 on pts/0
Linux pve 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021 11:08:47 +0100) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
root@pve:~# qm config 100
balloon: 768
boot: dcn
bootdisk: scsi0
cores: 1
ide2: none,media=cdrom
memory: 3072
name: lab-monitor
net0: virtio=E6:E7:9A:8A:64:C8,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: ssdraid1vg:vm-100-disk-1,size=20G
scsihw: virtio-scsi-pci
smbios1: uuid=6034898f-447a-42e2-bec8-6953912290ab
sockets: 1
vga: vmware
vmgenid: 73cefd3e-9adc-4c8a-94a4-c2c01ad392a2
root@pve:~# qm config 101
balloon: 1024
bootdisk: scsi0
cores: 2
ide2: nas-pve:iso/xubuntu-18.04.3-desktop-amd64.iso,media=cdrom,size=1467360K
memory: 8192
name: lab-backup
net0: virtio=D2:B8:CE:00:A6:AA,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: ssdraid1vg:vm-101-disk-1,size=20G
scsihw: virtio-scsi-pci
smbios1: uuid=1947c77d-251c-4596-bac0-a397f3489da4
sockets: 1
vga: vmware
vmgenid: 82d6e387-699f-4009-85b1-e0ef8b548d59
root@pve:~# qm config 102
balloon: 8192
boot: order=scsi0;ide2;net0
cores: 8
ide2: none,media=cdrom
memory: 32768
name: lab-docker
net0: virtio=26:18:AE:7B:89:C4,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: ssdraid1vg:vm-102-disk-1,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=d99e8da6-7117-4c12-ba60-9f475eded444
sockets: 1
vga: vmware
vmgenid: 783045d9-548f-4a42-8b4a-5de3a0e38f66
root@pve:~#
 
OK, a bit more investigating has revealed at least one pattern. The VMs are in fact NOT all the same version of Xubuntu. I created another VM, 103, and installed Xubuntu 20.04.2 to it, same as 102. Just like 102, it stopped responding to all input in the novnc console after about a minute of remaining idle. VMs 100 and 101 are running Xubuntu 18.04, and work just fine. So there seems to be some disconnect between vncproxy/novnc and Xubuntu 20.04.

I'm debating whether to install a different VNC server directly on the guests themselves and just bypass the Proxmox Console (I'd rather not); downgrade both the malfunctioning VMs to 18.04, which is still supported until 2023-Apr; or continue to try to fix novnc in Proxmox. I'd still prefer to use the in-browser novnc console.

It does not seem to be browser specific. I've seen reports of issues with Touch API in Chromium browsers, but fiddling with those settings doesn't seem to make a difference. I've noticed that after I open the Console in 3 different browsers (Brave, FF, IE11) and move the mouse around the VM screen, the cursor moves in synchrony across all three. VNC does seem to be able to capture the mouse cursor-- it just doesn't seem to be able to send that movement to the guest.

There are still no errors reported in the PVE log (at least the one in the bottom pane of the UI) when novnc stops responding. Where else should I look? Would love any tips for what I can try. If I discover anything else, I'll be sure to update here. Thanks.
 
Final update:

Switching two offending VMs 102 and 103 back to default fixed the unresponsive novnc issue. It turned out to be the lock screen. For whatever reason, when configured to use the vmware driver, novnc continued to render the desktop that was "under" the lock screen, rather than the lock screen itself. That is why I was able to see the desktop and the cursor across multiple instances of novnc, but not able to interact with anything. With the default driver, I was able to unlock the desktop and use it as normal. I also disabled the "Use tablet for pointer" option on all 4 VMs for good measure, per advice received elsewhere.

For those wondering, I originally switched to the vmware compatible display driver due to an issue I had 3 years ago where the default std driver failed to load the 18.04 desktop environment altogether-- I would get just a blank screen with a flashing cursor. That was still the case today when I switched 100 and 101 back to the default. However, performing a release upgrade for each of those to 20.04 solved the problem there as well, and now I can use the default display driver in all 4 VMs without issue. Just wish I had that option 3 years ago. :P