[SOLVED] Failed to run vncproxy in all VMs / Windows VMs no longer booting

audioPhil

New Member
Jul 29, 2020
6
0
1
44
Hi all,

we are running Proxmox on a single server (no Cluster) and VMs are stored locally on an LVM storage.
Currently the system is running ~10-12 VMs. Most of them are Linux server systems, but we are also running two Windows machines.
During everyday use, all VMs were running fine and no issues occurred (SSH, HTTP, SMB, RDP were all behaving as expected).

Today, my colleague tried to boot up a Windows VM which had been offline for some weeks. He realized he could not get an RDP connection and so he tried to access the machine via the Proxmox noVNC console.
However, this was not working either. The Console tab shows "Verbinden..." (I suspect the English translation would be "Connecting...") and after a while the following error appears:
Code:
VM 114 qmp command 'change' failed - unable to connect to VM 114 qmp socket - timeout after 600 retries
TASK ERROR: Failed to run vncproxy.

This is when I got involved into this issue (I am sort of the main administrator of the Proxmox instance).

So far I found out / tried the following:
- For almost all of the VMs the noVNC console is not working.
- For some non-critical systems I tried to initiate a shutdown via the Web GUI, as well as via qm shutdown <vmid>. The systems with a non-working noVNC console cannot be shutdown this way. The systems shows VM quit/powerdown failed.
- I logged into some of the Linux VMs via SSH and initiated a shutdown. I then restarted these VMs via the Web GUI and they are now working as expected (noVNC is showing, I can now initiate a shutdown too).
- I tried the same for one of the (up to that point still working) Windows VMs. It does not come back online, is neither accessible via RDP nor noVNC. While a start of the Windows VM seems successful via the Web GUI, starting it via qm start <vmid> shows
Code:
start failed: command '/usr/bin/kvm -id 118 -name Office-36 -chardev 'socket,id=qmp,path=/var/run/qemu-server/118.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/118.pid -daemonize -smbios 'type=1,uuid=0bb73145-4d1e-451a-9114-e681e89baeea' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/118.vnc,password -no-hpet -cpu 'kvm64,enforce,hv_ipi,hv_relaxed,hv_reset,hv_runtime,hv_spinlocks=0x1fff,hv_stimer,hv_synic,hv_time,hv_vapic,hv_vpindex,+kvm_pv_eoi,+kvm_pv_unhalt,+lahf_lm,+sep' -m 3072 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'vmgenid,guid=c026deef-a4ff-4772-8028-487a4e00c961' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:4aa886956b8e' -drive 'file=/dev/pve/vm-118-disk-0,if=none,id=drive-ide0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'ide-hd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=100' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -netdev 'type=tap,id=net0,ifname=tap118i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' -device 'e1000,mac=82:93:FB:78:4C:09,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -rtc 'driftfix=slew,base=localtime' -machine 'type=pc+pve0' -global 'kvm-pit.lost_tick_policy=discard'' failed: got timeout
- The problem also occurs if I try to create a new Windows VM from scratch. Not even the initial boot into the installer ISO file is working.
- The problem does not occur if I try to create a new Linux VM from scratch. The installer loads perfectly as expected.
- I installed the latest packages (apt update && apt upgrade) but nothing changed.
- I did not yet restart the hypervisor itself. One of the Windows VMs is still running and I would prefer not to lose access to it too, as it is in daily use.
- I searched for the issue on the Proxmox forum and found this thread which describes a similar behaviour regarding the "Failed to run vncproxy" part. However, the issue described there has been resolved with an upgrade to pve-qemu-kvm 4.0.0-7. We are running pve-qemu-kvm/stable 5.0.0-11 amd64


What could have led to this error?
A week ago I updated the packages on the Proxmox instance. After that, I did not reboot the hypervisor. As we are not using the noVNC feature on a daily base it could very well be that one of the updates led to the problem and has remained undetected until today.
However, I have one Linux VM with an uptime of 6 days (rebooted right after the update of Proxmox) which is working perfectly, while another Linux VM with an uptime of 4 days (rebooted multiple days after the update) is not working.

Likely not related, but something I changed recently:
I introduced named admin accounts for my colleagues and me and deactivated the default root account. This had some unexpected side effects such as updates no longer being possible via the Web GUI. Therefore, I updated via the CLI using apt. To rule this out as a possible reason I reactivated the root account today. The issues persist.

Information on our system:
Code:
root@aphrodite:/var/log# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.0.21-1-pve)
pve-manager: 6.2-10 (running version: 6.2-10/a20769ed)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-1
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-12
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-11
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-11
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

qm config for one of the Windows VMs which are not working (removing the ISO file does not change anything):
Code:
root@aphrodite:/var/log# qm config 114
bootdisk: ide0
cores: 2
ide0: local-lvm:vm-114-disk-0,size=45G
ide2: local:iso/SW_DVD5_Office_Professional_Plus_2019_32_BIT_X64_German_C2R_X21-84630.ISO,media=cdrom,size=3431288K
memory: 3072
name: REDACTED
net0: e1000=02:49:BE:86:7B:F0,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
parent: REDACTED
scsihw: virtio-scsi-pci
smbios1: uuid=c5639d23-a61e-4fa7-bcd2-6e6c7c546fe3
sockets: 1
vmgenid: 200b9229-0fb8-4357-a4f9-664522104366

I am out of ideas and hope that someone might be able to help. If required, I could organize a reboot of the hypervisor relatively quickly. A short downtime of the VMs is no problem but I am afraid that we will then be without any working Windows VM.

Regards,
Phil
 
Code:
VM 114 qmp command 'change' failed - unable to connect to VM 114 qmp socket - timeout after 600 retries

this indicates some kind of issue talking to the Qemu process in general, which is not VNC specific. regular shutdown failing also points to this as issue. what happens if you open up the 'monitor' interface in the GUI and type 'help'? is there anything peculiar in the system logs when you start a VM that does then not work?

you are running a very old kernel, so rebooting is likely in order anyway.
 
The 'monitor' interface shows a similar error message:
Code:
Type 'help' for help.
# help
ERROR: VM 114 qmp command 'human-monitor-command' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries

System log shows lots of kernel messages, in particular a "kernel NULL pointer dereference" and a call trace related to this bug.

System log snippet:
Code:
Jul 30 11:58:26 aphrodite systemd[1]: Started 114.scope.
Jul 30 11:58:26 aphrodite systemd-udevd[3972]: Using default interface naming scheme 'v240'.
Jul 30 11:58:26 aphrodite systemd-udevd[3972]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 30 11:58:26 aphrodite systemd-udevd[3972]: Could not generate persistent MAC address for tap114i0: No such file or directory
Jul 30 11:58:27 aphrodite kernel: [19416042.179524] device tap114i0 entered promiscuous mode
Jul 30 11:58:27 aphrodite systemd-udevd[3972]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 30 11:58:27 aphrodite systemd-udevd[3972]: Could not generate persistent MAC address for fwbr114i0: No such file or directory
Jul 30 11:58:27 aphrodite systemd-udevd[3972]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 30 11:58:27 aphrodite systemd-udevd[3972]: Could not generate persistent MAC address for fwpr114p0: No such file or directory
Jul 30 11:58:27 aphrodite systemd-udevd[3974]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 30 11:58:27 aphrodite systemd-udevd[3974]: Using default interface naming scheme 'v240'.
Jul 30 11:58:27 aphrodite systemd-udevd[3974]: Could not generate persistent MAC address for fwln114i0: No such file or directory
Jul 30 11:58:27 aphrodite kernel: [19416042.217613] fwbr114i0: port 1(fwln114i0) entered blocking state
Jul 30 11:58:27 aphrodite kernel: [19416042.218243] fwbr114i0: port 1(fwln114i0) entered disabled state
Jul 30 11:58:27 aphrodite kernel: [19416042.218874] device fwln114i0 entered promiscuous mode
Jul 30 11:58:27 aphrodite kernel: [19416042.219525] fwbr114i0: port 1(fwln114i0) entered blocking state
Jul 30 11:58:27 aphrodite kernel: [19416042.220210] fwbr114i0: port 1(fwln114i0) entered forwarding state
Jul 30 11:58:27 aphrodite kernel: [19416042.224951] vmbr0: port 13(fwpr114p0) entered blocking state
Jul 30 11:58:27 aphrodite kernel: [19416042.225533] vmbr0: port 13(fwpr114p0) entered disabled state
Jul 30 11:58:27 aphrodite kernel: [19416042.226334] device fwpr114p0 entered promiscuous mode
Jul 30 11:58:27 aphrodite kernel: [19416042.226948] vmbr0: port 13(fwpr114p0) entered blocking state
Jul 30 11:58:27 aphrodite kernel: [19416042.227474] vmbr0: port 13(fwpr114p0) entered forwarding state
Jul 30 11:58:27 aphrodite kernel: [19416042.232115] fwbr114i0: port 2(tap114i0) entered blocking state
Jul 30 11:58:27 aphrodite kernel: [19416042.232640] fwbr114i0: port 2(tap114i0) entered disabled state
Jul 30 11:58:27 aphrodite kernel: [19416042.233215] fwbr114i0: port 2(tap114i0) entered blocking state
Jul 30 11:58:27 aphrodite kernel: [19416042.233800] fwbr114i0: port 2(tap114i0) entered forwarding state
Jul 30 11:58:27 aphrodite kernel: [19416042.269259] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
Jul 30 11:58:27 aphrodite kernel: [19416042.269889] #PF error: [INSTR]
Jul 30 11:58:27 aphrodite kernel: [19416042.270503] PGD 0 P4D 0 
Jul 30 11:58:27 aphrodite kernel: [19416042.270996] Oops: 0010 [#9] SMP PTI
Jul 30 11:58:27 aphrodite kernel: [19416042.271595] CPU: 3 PID: 4025 Comm: kvm Tainted: P      D    O      5.0.21-1-pve #1
Jul 30 11:58:27 aphrodite kernel: [19416042.272114] Hardware name: MSI MS-7A70/B250M PRO-VDH (MS-7A70), BIOS A.10 12/05/2016
Jul 30 11:58:27 aphrodite kernel: [19416042.272709] RIP: 0010:          (null)
Jul 30 11:58:27 aphrodite kernel: [19416042.273282] Code: Bad RIP value.
[...]
Jul 30 11:58:27 aphrodite kernel: [19416042.280555] Call Trace:
Jul 30 11:58:27 aphrodite kernel: [19416042.281109]  kvm_vcpu_ioctl_get_hv_cpuid+0x44/0x220 [kvm]
Jul 30 11:58:27 aphrodite kernel: [19416042.281762]  ? vmx_vcpu_load+0x21f/0x550 [kvm_intel]
Jul 30 11:58:27 aphrodite kernel: [19416042.282344]  ? apparmor_file_alloc_security+0x42/0x190
Jul 30 11:58:27 aphrodite kernel: [19416042.282946]  ? get_page_from_freelist+0xf55/0x1440
Jul 30 11:58:27 aphrodite kernel: [19416042.283574]  ? kvm_arch_vcpu_load+0x94/0x250 [kvm]
Jul 30 11:58:27 aphrodite kernel: [19416042.284152]  ? vmx_vcpu_put+0x1a/0x20 [kvm_intel]
Jul 30 11:58:27 aphrodite kernel: [19416042.284782]  kvm_arch_vcpu_ioctl+0x14b/0x11f0 [kvm]
Jul 30 11:58:27 aphrodite kernel: [19416042.285362]  ? __alloc_pages_nodemask+0x13f/0x2e0
Jul 30 11:58:27 aphrodite kernel: [19416042.285947]  ? mem_cgroup_commit_charge+0x82/0x4d0
Jul 30 11:58:27 aphrodite kernel: [19416042.286572]  ? mem_cgroup_try_charge+0x8b/0x190
Jul 30 11:58:27 aphrodite kernel: [19416042.287124]  kvm_vcpu_ioctl+0xe5/0x610 [kvm]
Jul 30 11:58:27 aphrodite kernel: [19416042.287736]  do_vfs_ioctl+0xa9/0x640
Jul 30 11:58:27 aphrodite kernel: [19416042.288333]  ? handle_mm_fault+0xe1/0x210
Jul 30 11:58:27 aphrodite kernel: [19416042.288963]  ksys_ioctl+0x67/0x90
Jul 30 11:58:27 aphrodite kernel: [19416042.289576]  __x64_sys_ioctl+0x1a/0x20
Jul 30 11:58:27 aphrodite kernel: [19416042.290167]  do_syscall_64+0x5a/0x110
Jul 30 11:58:27 aphrodite kernel: [19416042.290877]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 30 11:58:27 aphrodite kernel: [19416042.291583] RIP: 0033:0x7f91e76a3427
[...]
Jul 30 11:58:27 aphrodite kernel: [19416042.297319] Modules linked in: ip6table_raw iptable_raw arc4 md4 cmac nls_utf8 cifs ccm fscache dm_snapshot tcp_diag inet_diag uas usb_storage veth ebtable_filter ebtables ip_set ip6table_filter ip6_tables iptable_filter bpfilter softdog nfnetlink_log nfnetlink snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel crct10dif_pclmul crc32_pclmul ghash_clmulni_intel zfs(PO) aesni_intel aes_x86_64 zunicode(PO) crypto_simd cryptd glue_helper intel_cstate zlua(PO) ppdev i915 kvmgt vfio_mdev intel_rapl_perf mdev snd_hda_intel input_leds intel_wmi_thunderbolt vfio_iommu_type1 vfio snd_hda_codec serio_raw kvm snd_hda_core irqbypass drm_kms_helper snd_hwdep snd_pcm drm i2c_algo_bit snd_timer fb_sys_fops pcspkr syscopyarea sysfillrect snd sysimgblt soundcore mei_me mei mxm_wmi parport_pc parport mac_hid acpi_pad zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm
Jul 30 11:58:27 aphrodite kernel: [19416042.297342]  ib_core sunrpc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c hid_generic usbkbd usbhid hid psmouse mpt3sas raid_class scsi_transport_sas i2c_i801 ahci r8169 realtek libahci wmi video
Jul 30 11:58:27 aphrodite kernel: [19416042.305093] CR2: 0000000000000000
Jul 30 11:58:27 aphrodite kernel: [19416042.306013] ---[ end trace b5f0287ed1df42ca ]---
Jul 30 11:58:27 aphrodite kernel: [19416042.306844] RIP: 0010:          (null)
Jul 30 11:58:27 aphrodite kernel: [19416042.307706] Code: Bad RIP value.
[...]

So I guess a reboot would be the next logical step, as it seems to be a kernel-related problem?
 
Thanks for the support. Restarting the hypervisor seems to resolve the issue. All VMs came back online properly, and they seem to work as expected.
I marked this thread as solved.
 
I have the same problem , it seems that this is a bug of PVE, not sure whether the newest version was fixed it or not .

just shutdown and then start the VM can resolved this error , but for some important VMs , we still wish we can fixed the same issue without shutdown and restart .
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!