Windows VMs stop on RDP or local console

FelixJ · Mar 1, 2019

Hello Everybody,
I have some serious problems regarding some Windows VMs (Server 2016 and 2012r2), which are stopping without any reason or trace to any log (either /var/log/syslog or (after vm reboot within Windows Eventviewer)).
The Server 2016 is running as MS AD DC, the 2012R2 Server is running as Remote Desktop Server.
It happens in a reproducible manner if a user (no matter admin or normal user) connects interactively to the Windows Guest.
It makes no difference if the connection is initiated via RDP or directly using the Web Console.
Usually I will connect only as Admin to the DC, either via RDP or Web Console, where as to the Remote-Desktop Server the users connect via RDP, while as me as Admin again using both RDP and Web Console.

To eliminate the "damaged Hardware issue" I have added a second node and created a cluster and migrated one of the Windows Guests to the other node to eliminate possible KSM issues.
I have also set up a new Windows Guest and joined it to the Domain. Same issue.

All 3 VMs have a running commercial AV engine running.

Ther's also a Ubuntu guest running, which seams not to be influenced by this...

Now the Host-Side:
I am running an up-to-date PVE-Cluster with 2 nodes (See pveversion -v) below on the no-subscription channel.
The underlying Hardware is in both cases HP Proliant DL360 (8xIntel(R) Xeon(R) CPU E5-2450L 0 @ 1.80GHz, 32GB ECC DDR3 and up-to-date Bios, Raid5 HP Smart Array P420).

The most anoying thing about this is, that there is no trace of a log entry, no evidence on the hypervisor-side, for why the VMs have been terminated.

This spooky behavior has started around Dec. 2018, and I am constantly upgrading to be sure to be up to date using the non-subscription channel.

dmesg -T will only display regarding the virtual interfaces that have removed / added to the vmbr after offline state and after manually restarting it:
[Fri Mar 1 09:14:37 2019] vmbr1: port 4(tap103i0) entered disabled state
[Fri Mar 1 09:14:37 2019] vmbr1: port 4(tap103i0) entered disabled state
[Fri Mar 1 09:49:32 2019] device tap103i0 entered promiscuous mode
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered blocking state
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered disabled state
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered blocking state
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered forwarding state

/var/log/syslog only shows, that the vm is gone: (failed to run vncproxy)
Mar 1 09:49:12 remote2 qm[9303]: VM 103 qmp command failed - VM 103 not running
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: HUP conn (1837-9303-32) (ipcs.c:759:qb_ipcs_dispatch_connection_request)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: qb_ipcs_disconnect(1837-9303-32) state:2 (ipcs.c:606:qb_ipcs_disconnect)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: Connection to pid:9303 destroyed (server.c:147:s1_connection_closed_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: connection about to be freed (server.c:132:s1_connection_destroyed_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: Free'ing ringbuffer: /dev/shm/qb-pve2-response-1837-9303-32-header (ringbuffer_helper.c:337:qb_rb_close_helper)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: Free'ing ringbuffer: /dev/shm/qb-pve2-event-1837-9303-32-header (ringbuffer_helper.c:337:qb_rb_close_helper)
Mar 1 09:49:12 remote2 pvedaemon[9301]: Failed to run vncproxy.
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: Free'ing ringbuffer: /dev/shm/qb-pve2-request-1837-9303-32-header (ringbuffer_helper.c:337:qb_rb_close_helper)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process msg:4, size:5460 (server.c:168:s1_msg_process_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process result 0 (server.c:318:s1_msg_process_fn)
Mar 1 09:49:12 remote2 pvedaemon[7109]: <root@pam> end task UPID:remote2:00002455:03E3F31C:5C78F206:vncproxy:103:root@pam: Failed to run vncproxy.
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process msg:7, size:134 (server.c:168:s1_msg_process_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process result 0 (server.c:318:s1_msg_process_fn)

pveversion -v:
proxmox-ve: 5.3-1 (running kernel: 4.15.18-11-pve)
pve-manager: 5.3-9 (running version: 5.3-9/ba817b29)
pve-kernel-4.15: 5.3-2
pve-kernel-4.15.18-11-pve: 4.15.18-33
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-46
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-38
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-34
pve-docs: 5.3-2
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-46
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1

Here are the VM Config files:

cat /etc/pve/qemu-server/100.conf (This is the Server 2016, which is acting as DC and is affected when I connect via WebUI or RDP):
agent: 1
bootdisk: virtio0
cores: 2
cpu: host
ide0: none,media=cdrom
memory: 4096
name: dc-zstahl-01
net0: virtio=E6:27

F:04:4D:2C,bridge=vmbr1
numa: 0
onboot: 1
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=5490fb8e-33b2-4555-ad0e-d244b6660bf6
sockets: 1
startup: order=1
vga: qxl
virtio0: local-lvm:vm-100-disk-0,format=raw,size=60G
virtio1: local-lvm:vm-100-disk-1,format=raw,size=500G

cat /etc/pve/qemu-server/101.conf
agent: 1
args: -device intel-hda,id=sound5,bus=pci.0,addr=0x18 -device hda-micro,id=sound5-codec0,bus=sound5.0,cad=0 -device hda-duplex,id=sound5-codec1,bus=sound5.0,cad=1
bootdisk: virtio0
cores: 2
hotplug: disk,usb
ide0: none,media=cdrom
ide2: none,media=cdrom
memory: 15360
name: ts-zstahl-01
net0: virtio=9E:92

E:0A:4E:CE,bridge=vmbr1
numa: 0
onboot: 1
ostype: win8
scsihw: virtio-scsi-pci
smbios1: uuid=170fccd6-fe7f-42af-a6fc-506a2248f493
sockets: 3
startup: order=2,up=10
vga: qxl
virtio0: local-lvm:vm-101-disk-0,format=raw,size=100G

cat /etc/pve/qemu-server/103.conf(this is the all new Server 2012R2 which I have set up to debug) and which is also affected:

bootdisk: virtio0
cores: 4
ide2: local:iso/SW_DVD9_Windows_Svr_Std_and_DataCtr_2012_R2_64Bit_German_-4_MLF_X19-82895.ISO,media=cdrom
ide3: local:iso/virtio-win.iso,media=cdrom,size=367806K
memory: 15360
name: ts-zstahl-02
net0: virtio=C2:9B

4:56:6E:0F,bridge=vmbr1
numa: 0
ostype: win8
parent: BasisInstallation
scsihw: virtio-scsi-pci
smbios1: uuid=c7449310-bdc8-4264-9710-1a508ec27d8c
sockets: 1
virtio0: local-lvm:vm-103-disk-0,size=100G
vmgenid: 4b1b8d9e-2647-4602-9167-ebab7ed19f78

cat /etc/pve/qemu-server/102.conf (this is the Ubuntu VM, which is not affected when I connect to the WebUI Console, neither when I connect via ssh!)
agent: 1
bootdisk: virtio0
cores: 1
ide2: none,media=cdrom
memory: 512
name: vpn-zstahl-01
net0: virtio=9A:54:F7:C0:00:17,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=be026919-714a-4f79-87a1-c2ba66276b95
sockets: 1
startup: order=3
vga: qxl
virtio0: local-lvm:vm-102-disk-0,format=raw,size=10G

I hope, I have delivered all necessary information to help someone like you to help me resolve that problem!
Thank you in advanced for helping me!
Felix

dcsapak · Mar 1, 2019

did you already check the log from inside the vm?
are you sure that the vm stops the moment you connect, and not some time before that (e.g. standby/suspend)?

FelixJ · Mar 1, 2019

dcsapak said:
did you already check the log from inside the vm?
are you sure that the vm stops the moment you connect, and not some time before that (e.g. standby/suspend)?

Yes, I did. Eventlog/System only brings Errors "Kernel-Power" which states, that the Computer has been previously shut down unexpectedly.
And Yes, it is reproducible.
Every time it happens, it happens when you either log in via RDP or the WebUI console.
What I don't understand is, why aren't there any traces of killed processes under Linux... it's like if someone would make kill -9 to the kvm process that's runs the VM.

dcsapak · Mar 1, 2019

FelixJ said:
What I don't understand is, why aren't there any traces of killed processes under Linux... it's like if someone would make kill -9 to the kvm process that's runs the VM.

can you post the complete syslog from when such vm is started until it does not work anymore?
if you have current packages, there should be a 'qm cleanup' log entry when the vm gets stopped like this:

Feb 28 14:38:59 hostname qmeventd[1699]: Starting cleanup for 102
Feb 28 14:38:59 hostname qmeventd[1699]: Finished cleanup for 102

FelixJ · Mar 1, 2019

dcsapak said:
can you post the complete syslog from when such vm is started until it does not work anymore?
if you have current packages, there should be a 'qm cleanup' log entry when the vm gets stopped like this:

This is all I've got from when my new Test VM went down...:
root@pmx2:~# grep qmeventd /var/log/syslog
Mar 1 09:14:18 pmx2 qmeventd[1119]: Starting cleanup for 103
Mar 1 09:14:18 pmx2 qmeventd[1119]: Cannot find device "tap103i0"
Mar 1 09:14:18 pmx2 qmeventd[1119]: can't unenslave 'tap103i0

So from what I can see, the "Finished cleanup" task is missing... odd...
regards,
Felix

FelixJ · Mar 7, 2019

Good Evening dcspark!
Have you or anyone else who reads that any clue on this issue?
What more Information could I provide to resolve that mystery?
Thank you for any help!
regards,
Felix

dcsapak · Mar 8, 2019

dcsapak said:
can you post the complete syslog from when such vm is started until it does not work anymore?

you did not provide the syslog, what could also help is to start the vm in the foreground on the commandline and see if it logs some errors

you can get the commandline with 'qm showcmd ID --pretty'
and remove the 'daemonized' part of the commandline, then it starts in the foreground

FelixJ · Mar 8, 2019

dcsapak said:
you did not provide the syslog, what could also help is to start the vm in the foreground on the commandline and see if it logs some errors

you can get the commandline with 'qm showcmd ID --pretty'
and remove the 'daemonized' part of the commandline, then it starts in the foreground

Her's the output:
root@remote:~# qm showcmd 101 --pretty
/usr/bin/kvm \
-id 101 \
-name ts-zstahl-01 \
-chardev 'socket,id=qmp,path=/var/run/qemu-server/101.qmp,server,nowait' \
-mon 'chardev=qmp,mode=control' \
-chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
-mon 'chardev=qmp-event,mode=control' \
-pidfile /var/run/qemu-server/101.pid \
-daemonize \
-smbios 'type=1,uuid=170fccd6-fe7f-42af-a6fc-506a2248f493' \
-smp '6,sockets=3,cores=2,maxcpus=6' \
-nodefaults \
-boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
-vnc unix:/var/run/qemu-server/101.vnc,x509,password \
-no-hpet \
-cpu 'kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer,enforce' \
-m 15360 \
-device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
-device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
-device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' \
-device 'qxl-vga,id=vga,bus=pci.0,addr=0x2' \
-chardev 'socket,path=/var/run/qemu-server/101.qga,server,nowait,id=qga0' \
-device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' \
-device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' \
-spice 'tls-port=61000,addr=127.0.0.1,tls-ciphers=HIGH,seamless-migration=on' \
-device 'virtio-serial,id=spice,bus=pci.0,addr=0x9' \
-chardev 'spicevmc,id=vdagent,name=vdagent' \
-device 'virtserialport,chardev=vdagent,name=com.redhat.spice.0' \
-device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' \
-iscsi 'initiator-name=iqn.1993-08.org.debian:01:836fee027b5' \
-drive 'if=none,id=drive-ide0,media=cdrom,aio=threads' \
-device 'ide-cd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=200' \
-drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' \
-device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=201' \
-drive 'file=/dev/pve/vm-101-disk-0,if=none,id=drive-virtio0,format=raw,cache=none,aio=native,detect-zeroes=on' \
-device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' \
-netdev 'type=tap,id=net0,ifname=tap101i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' \
-device 'virtio-net-pci,mac=9E:92

E:0A:4E:CE,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' \
-rtc 'driftfix=slew,base=localtime' \
-machine 'type=pc' \
-global 'kvm-pit.lost_tick_policy=discard' \
-device 'intel-hda,id=sound5,bus=pci.0,addr=0x18' \
-device 'hda-micro,id=sound5-codec0,bus=sound5.0,cad=0' \
-device 'hda-duplex,id=sound5-codec1,bus=sound5.0,cad=1'

Regarding syslog: I thought, you were focused on the qmeventd messages es you mentioned them.
I have restarted the VM now within a gnu-screen session without the daemonize message. However, there is no feedback message in the output aka "Starting VM 101" or so...
Thought I have read the kvm manual page, and it seams, there are some loging options. However, i don't know, how to use them to get meaningfull output...

BTW: are the start parameters "normal" and as they should be?
thanks,
Felix

FelixJ · Mar 10, 2019

Here's what happens, when it run's in foreground and crashes:
kvm: /home/builder/source/qemu.tmp/exec.c:1252: cpu_physical_memory_snapshot_get_dirty: Assertion `start + length <= snap->end' failed.
start101inforeground.sh: line 43: 25985 Aborted /usr/bin/kvm -id 101 -name ts-zstahl-01 -chardev 'socket,id=qmp,path=/var/run/qemu-server/101.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/101.pid -smbios 'type=1,uuid=170fccd6-fe7f-42af-a6fc-506a2248f493' -smp '6,sockets=3,cores=2,maxcpus=6' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/101.vnc,x509,password -no-hpet -cpu 'kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer,enforce' -m 15360 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'qxl-vga,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/101.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -spice 'tls-port=61000,addr=127.0.0.1,tls-ciphers=HIGH,seamless-migration=on' -device 'virtio-serial,id=spice,bus=pci.0,addr=0x9' -chardev 'spicevmc,id=vdagent,name=vdagent' -device 'virtserialport,chardev=vdagent,name=com.redhat.spice.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:836fee027b5' -drive 'if=none,id=drive-ide0,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=200' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=201' -drive 'file=/dev/pve/vm-101-disk-0,if=none,id=drive-virtio0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap101i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=9E:92

E:0A:4E:CE,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -rtc 'driftfix=slew,base=localtime' -machine 'type=pc' -global 'kvm-pit.lost_tick_policy=discard' -device 'intel-hda,id=sound5,bus=pci.0,addr=0x18' -device 'hda-micro,id=sound5-codec0,bus=sound5.0,cad=0' -device 'hda-duplex,id=sound5-codec1,bus=sound5.0,cad=1'

Any ideas?
Regarding the syslog you've requested: how can I attach a file? It's more then 700 lines long?
Thanks, Felix

dcsapak · Mar 11, 2019

hi thanks for that

it seems this was an open bug (https://bugs.launchpad.net/qemu/+bug/1785197)
and fixed in may 2018, but never backported to 2.12 https://git.qemu.org/?p=qemu.git;a=commit;h=a89fe6c3297

i try and see if this fixed the issue, if it does, we can include that patch into our version of qemu

dcsapak · Mar 11, 2019

i compiled a test package with the commit from above included: http://download.proxmox.com/temp/qemu-test-2019-03-11/pve-qemu-kvm_2.12.1-3~test01_amd64.deb

could you install that and test it ?

note: you have to poweroff and start the vm again after installing, so that the new qemu version is used

FelixJ · Mar 11, 2019

I have applied the patch. Prior to that patch, the system would crash approximately every 2 days, however it also run up to 11 days without incident, depending on how many logins where.
So I have to wait a while (up to 2 weeks of usage) until it's safe to say, that this patch fixed the problem.

Meanwhile:
As far as I decrypt from the error message and the bug report, this issue has something to do with the virtual VGA graphics adapter. Is that right?
In this article (linked to the bug report you mentioned) the recommendation to mitigate this problem is to switch to cirrus graphics driver. Would this help as well?
regards,
Felix

dcsapak · Mar 12, 2019

FelixJ said:
So I have to wait a while (up to 2 weeks of usage) until it's safe to say, that this patch fixed the problem.

great, i will wait for the result

FelixJ said:
Would this help as well?

could be, but note that cirrus has had many security issues in the past (this is the reason we changed the default and removed it from the gui)

you can still set it:

Code:

qm set VMID -vga cirrus

dcsapak · Apr 2, 2019

any news yet ?

Whatever · Apr 2, 2019

Dominik, what about using VirtioGPU would be a solution and are there any benefits of using it?
Thanks in advance

dcsapak · Apr 2, 2019

Whatever said:
VirtioGPU would be a solution and are there any benefits of using it?

this only works with linux at the moment (and recent guest kernels) so not for windows

Whatever · Apr 2, 2019

Thanks

FelixJ · Apr 3, 2019

dcsapak said:
any news yet ?

Hi Good Morning,
Sorry, I couldn't get back any sooner.
Yes, so far there have not been any issues anymore and the system has been undergoing an extreme phase of login/logoffs as well as local console administration to ensure stability on the admin side too.
On my opinion, your patch solved the problem, and you could release it to productive environments.

regards,
Felix

Whatever · Apr 11, 2019

dcsapak said:
any news yet ?

Has provided patch been included to 5.4? Thanks

dcsapak · Apr 11, 2019

Whatever said:
Has provided patch been included to 5.4? Thanks

hi sadly no, but i will poke the right people so that we include it as soon as possible

Windows VMs stop on RDP or local console

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Well-Known Member

Renowned Member

Proxmox Staff Member

We value your privacy