Windows VMs stop on RDP or local console

FelixJ

New Member
Mar 1, 2019
9
1
3
39
Hello Everybody,
I have some serious problems regarding some Windows VMs (Server 2016 and 2012r2), which are stopping without any reason or trace to any log (either /var/log/syslog or (after vm reboot within Windows Eventviewer)).
The Server 2016 is running as MS AD DC, the 2012R2 Server is running as Remote Desktop Server.
It happens in a reproducible manner if a user (no matter admin or normal user) connects interactively to the Windows Guest.
It makes no difference if the connection is initiated via RDP or directly using the Web Console.
Usually I will connect only as Admin to the DC, either via RDP or Web Console, where as to the Remote-Desktop Server the users connect via RDP, while as me as Admin again using both RDP and Web Console.

To eliminate the "damaged Hardware issue" I have added a second node and created a cluster and migrated one of the Windows Guests to the other node to eliminate possible KSM issues.
I have also set up a new Windows Guest and joined it to the Domain. Same issue.

All 3 VMs have a running commercial AV engine running.

Ther's also a Ubuntu guest running, which seams not to be influenced by this...

Now the Host-Side:
I am running an up-to-date PVE-Cluster with 2 nodes (See pveversion -v) below on the no-subscription channel.
The underlying Hardware is in both cases HP Proliant DL360 (8xIntel(R) Xeon(R) CPU E5-2450L 0 @ 1.80GHz, 32GB ECC DDR3 and up-to-date Bios, Raid5 HP Smart Array P420).

The most anoying thing about this is, that there is no trace of a log entry, no evidence on the hypervisor-side, for why the VMs have been terminated.

This spooky behavior has started around Dec. 2018, and I am constantly upgrading to be sure to be up to date using the non-subscription channel.

dmesg -T will only display regarding the virtual interfaces that have removed / added to the vmbr after offline state and after manually restarting it:
[Fri Mar 1 09:14:37 2019] vmbr1: port 4(tap103i0) entered disabled state
[Fri Mar 1 09:14:37 2019] vmbr1: port 4(tap103i0) entered disabled state
[Fri Mar 1 09:49:32 2019] device tap103i0 entered promiscuous mode
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered blocking state
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered disabled state
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered blocking state
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered forwarding state

/var/log/syslog only shows, that the vm is gone: (failed to run vncproxy)
Mar 1 09:49:12 remote2 qm[9303]: VM 103 qmp command failed - VM 103 not running
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: HUP conn (1837-9303-32) (ipcs.c:759:qb_ipcs_dispatch_connection_request)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: qb_ipcs_disconnect(1837-9303-32) state:2 (ipcs.c:606:qb_ipcs_disconnect)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: Connection to pid:9303 destroyed (server.c:147:s1_connection_closed_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: connection about to be freed (server.c:132:s1_connection_destroyed_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: Free'ing ringbuffer: /dev/shm/qb-pve2-response-1837-9303-32-header (ringbuffer_helper.c:337:qb_rb_close_helper)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: Free'ing ringbuffer: /dev/shm/qb-pve2-event-1837-9303-32-header (ringbuffer_helper.c:337:qb_rb_close_helper)
Mar 1 09:49:12 remote2 pvedaemon[9301]: Failed to run vncproxy.
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: Free'ing ringbuffer: /dev/shm/qb-pve2-request-1837-9303-32-header (ringbuffer_helper.c:337:qb_rb_close_helper)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process msg:4, size:5460 (server.c:168:s1_msg_process_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process result 0 (server.c:318:s1_msg_process_fn)
Mar 1 09:49:12 remote2 pvedaemon[7109]: <root@pam> end task UPID:remote2:00002455:03E3F31C:5C78F206:vncproxy:103:root@pam: Failed to run vncproxy.
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process msg:7, size:134 (server.c:168:s1_msg_process_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process result 0 (server.c:318:s1_msg_process_fn)


pveversion -v:
proxmox-ve: 5.3-1 (running kernel: 4.15.18-11-pve)
pve-manager: 5.3-9 (running version: 5.3-9/ba817b29)
pve-kernel-4.15: 5.3-2
pve-kernel-4.15.18-11-pve: 4.15.18-33
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-46
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-38
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-34
pve-docs: 5.3-2
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-46
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1

Here are the VM Config files:

cat /etc/pve/qemu-server/100.conf (This is the Server 2016, which is acting as DC and is affected when I connect via WebUI or RDP):
agent: 1
bootdisk: virtio0
cores: 2
cpu: host
ide0: none,media=cdrom
memory: 4096
name: dc-zstahl-01
net0: virtio=E6:27:DF:04:4D:2C,bridge=vmbr1
numa: 0
onboot: 1
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=5490fb8e-33b2-4555-ad0e-d244b6660bf6
sockets: 1
startup: order=1
vga: qxl
virtio0: local-lvm:vm-100-disk-0,format=raw,size=60G
virtio1: local-lvm:vm-100-disk-1,format=raw,size=500G

cat /etc/pve/qemu-server/101.conf
agent: 1
args: -device intel-hda,id=sound5,bus=pci.0,addr=0x18 -device hda-micro,id=sound5-codec0,bus=sound5.0,cad=0 -device hda-duplex,id=sound5-codec1,bus=sound5.0,cad=1
bootdisk: virtio0
cores: 2
hotplug: disk,usb
ide0: none,media=cdrom
ide2: none,media=cdrom
memory: 15360
name: ts-zstahl-01
net0: virtio=9E:92:DE:0A:4E:CE,bridge=vmbr1
numa: 0
onboot: 1
ostype: win8
scsihw: virtio-scsi-pci
smbios1: uuid=170fccd6-fe7f-42af-a6fc-506a2248f493
sockets: 3
startup: order=2,up=10
vga: qxl
virtio0: local-lvm:vm-101-disk-0,format=raw,size=100G

cat /etc/pve/qemu-server/103.conf(this is the all new Server 2012R2 which I have set up to debug) and which is also affected:

bootdisk: virtio0
cores: 4
ide2: local:iso/SW_DVD9_Windows_Svr_Std_and_DataCtr_2012_R2_64Bit_German_-4_MLF_X19-82895.ISO,media=cdrom
ide3: local:iso/virtio-win.iso,media=cdrom,size=367806K
memory: 15360
name: ts-zstahl-02
net0: virtio=C2:9B:D4:56:6E:0F,bridge=vmbr1
numa: 0
ostype: win8
parent: BasisInstallation
scsihw: virtio-scsi-pci
smbios1: uuid=c7449310-bdc8-4264-9710-1a508ec27d8c
sockets: 1
virtio0: local-lvm:vm-103-disk-0,size=100G
vmgenid: 4b1b8d9e-2647-4602-9167-ebab7ed19f78


cat /etc/pve/qemu-server/102.conf (this is the Ubuntu VM, which is not affected when I connect to the WebUI Console, neither when I connect via ssh!)
agent: 1
bootdisk: virtio0
cores: 1
ide2: none,media=cdrom
memory: 512
name: vpn-zstahl-01
net0: virtio=9A:54:F7:C0:00:17,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=be026919-714a-4f79-87a1-c2ba66276b95
sockets: 1
startup: order=3
vga: qxl
virtio0: local-lvm:vm-102-disk-0,format=raw,size=10G

I hope, I have delivered all necessary information to help someone like you to help me resolve that problem!
Thank you in advanced for helping me!
Felix
 

dcsapak

Proxmox Staff Member
Staff member
Feb 1, 2016
3,701
338
83
31
Vienna
did you already check the log from inside the vm?
are you sure that the vm stops the moment you connect, and not some time before that (e.g. standby/suspend)?
 

FelixJ

New Member
Mar 1, 2019
9
1
3
39
did you already check the log from inside the vm?
are you sure that the vm stops the moment you connect, and not some time before that (e.g. standby/suspend)?
Yes, I did. Eventlog/System only brings Errors "Kernel-Power" which states, that the Computer has been previously shut down unexpectedly.
And Yes, it is reproducible.
Every time it happens, it happens when you either log in via RDP or the WebUI console.
What I don't understand is, why aren't there any traces of killed processes under Linux... it's like if someone would make kill -9 to the kvm process that's runs the VM.
 

dcsapak

Proxmox Staff Member
Staff member
Feb 1, 2016
3,701
338
83
31
Vienna
What I don't understand is, why aren't there any traces of killed processes under Linux... it's like if someone would make kill -9 to the kvm process that's runs the VM.
can you post the complete syslog from when such vm is started until it does not work anymore?
if you have current packages, there should be a 'qm cleanup' log entry when the vm gets stopped like this:
Feb 28 14:38:59 hostname qmeventd[1699]: Starting cleanup for 102
Feb 28 14:38:59 hostname qmeventd[1699]: Finished cleanup for 102
 

FelixJ

New Member
Mar 1, 2019
9
1
3
39
can you post the complete syslog from when such vm is started until it does not work anymore?
if you have current packages, there should be a 'qm cleanup' log entry when the vm gets stopped like this:
This is all I've got from when my new Test VM went down...:
root@pmx2:~# grep qmeventd /var/log/syslog
Mar 1 09:14:18 pmx2 qmeventd[1119]: Starting cleanup for 103
Mar 1 09:14:18 pmx2 qmeventd[1119]: Cannot find device "tap103i0"
Mar 1 09:14:18 pmx2 qmeventd[1119]: can't unenslave 'tap103i0

So from what I can see, the "Finished cleanup" task is missing... odd...
regards,
Felix
 

FelixJ

New Member
Mar 1, 2019
9
1
3
39
Good Evening dcspark!
Have you or anyone else who reads that any clue on this issue?
What more Information could I provide to resolve that mystery?
Thank you for any help!
regards,
Felix
 

dcsapak

Proxmox Staff Member
Staff member
Feb 1, 2016
3,701
338
83
31
Vienna
can you post the complete syslog from when such vm is started until it does not work anymore?
you did not provide the syslog, what could also help is to start the vm in the foreground on the commandline and see if it logs some errors

you can get the commandline with 'qm showcmd ID --pretty'
and remove the 'daemonized' part of the commandline, then it starts in the foreground
 

FelixJ

New Member
Mar 1, 2019
9
1
3
39
you did not provide the syslog, what could also help is to start the vm in the foreground on the commandline and see if it logs some errors

you can get the commandline with 'qm showcmd ID --pretty'
and remove the 'daemonized' part of the commandline, then it starts in the foreground
Her's the output:
root@remote:~# qm showcmd 101 --pretty
/usr/bin/kvm \
-id 101 \
-name ts-zstahl-01 \
-chardev 'socket,id=qmp,path=/var/run/qemu-server/101.qmp,server,nowait' \
-mon 'chardev=qmp,mode=control' \
-chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
-mon 'chardev=qmp-event,mode=control' \
-pidfile /var/run/qemu-server/101.pid \
-daemonize \
-smbios 'type=1,uuid=170fccd6-fe7f-42af-a6fc-506a2248f493' \
-smp '6,sockets=3,cores=2,maxcpus=6' \
-nodefaults \
-boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
-vnc unix:/var/run/qemu-server/101.vnc,x509,password \
-no-hpet \
-cpu 'kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer,enforce' \
-m 15360 \
-device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
-device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
-device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' \
-device 'qxl-vga,id=vga,bus=pci.0,addr=0x2' \
-chardev 'socket,path=/var/run/qemu-server/101.qga,server,nowait,id=qga0' \
-device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' \
-device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' \
-spice 'tls-port=61000,addr=127.0.0.1,tls-ciphers=HIGH,seamless-migration=on' \
-device 'virtio-serial,id=spice,bus=pci.0,addr=0x9' \
-chardev 'spicevmc,id=vdagent,name=vdagent' \
-device 'virtserialport,chardev=vdagent,name=com.redhat.spice.0' \
-device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' \
-iscsi 'initiator-name=iqn.1993-08.org.debian:01:836fee027b5' \
-drive 'if=none,id=drive-ide0,media=cdrom,aio=threads' \
-device 'ide-cd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=200' \
-drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' \
-device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=201' \
-drive 'file=/dev/pve/vm-101-disk-0,if=none,id=drive-virtio0,format=raw,cache=none,aio=native,detect-zeroes=on' \
-device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' \
-netdev 'type=tap,id=net0,ifname=tap101i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' \
-device 'virtio-net-pci,mac=9E:92:DE:0A:4E:CE,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' \
-rtc 'driftfix=slew,base=localtime' \
-machine 'type=pc' \
-global 'kvm-pit.lost_tick_policy=discard' \
-device 'intel-hda,id=sound5,bus=pci.0,addr=0x18' \
-device 'hda-micro,id=sound5-codec0,bus=sound5.0,cad=0' \
-device 'hda-duplex,id=sound5-codec1,bus=sound5.0,cad=1'

Regarding syslog: I thought, you were focused on the qmeventd messages es you mentioned them.
I have restarted the VM now within a gnu-screen session without the daemonize message. However, there is no feedback message in the output aka "Starting VM 101" or so...
Thought I have read the kvm manual page, and it seams, there are some loging options. However, i don't know, how to use them to get meaningfull output...

BTW: are the start parameters "normal" and as they should be?
thanks,
Felix
 

FelixJ

New Member
Mar 1, 2019
9
1
3
39
Here's what happens, when it run's in foreground and crashes:
kvm: /home/builder/source/qemu.tmp/exec.c:1252: cpu_physical_memory_snapshot_get_dirty: Assertion `start + length <= snap->end' failed.
start101inforeground.sh: line 43: 25985 Aborted
/usr/bin/kvm -id 101 -name ts-zstahl-01 -chardev 'socket,id=qmp,path=/var/run/qemu-server/101.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/101.pid -smbios 'type=1,uuid=170fccd6-fe7f-42af-a6fc-506a2248f493' -smp '6,sockets=3,cores=2,maxcpus=6' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/101.vnc,x509,password -no-hpet -cpu 'kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer,enforce' -m 15360 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'qxl-vga,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/101.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -spice 'tls-port=61000,addr=127.0.0.1,tls-ciphers=HIGH,seamless-migration=on' -device 'virtio-serial,id=spice,bus=pci.0,addr=0x9' -chardev 'spicevmc,id=vdagent,name=vdagent' -device 'virtserialport,chardev=vdagent,name=com.redhat.spice.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:836fee027b5' -drive 'if=none,id=drive-ide0,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=200' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=201' -drive 'file=/dev/pve/vm-101-disk-0,if=none,id=drive-virtio0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap101i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=9E:92:DE:0A:4E:CE,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -rtc 'driftfix=slew,base=localtime' -machine 'type=pc' -global 'kvm-pit.lost_tick_policy=discard' -device 'intel-hda,id=sound5,bus=pci.0,addr=0x18' -device 'hda-micro,id=sound5-codec0,bus=sound5.0,cad=0' -device 'hda-duplex,id=sound5-codec1,bus=sound5.0,cad=1'

Any ideas?
Regarding the syslog you've requested: how can I attach a file? It's more then 700 lines long?
Thanks, Felix
 

dcsapak

Proxmox Staff Member
Staff member
Feb 1, 2016
3,701
338
83
31
Vienna

dcsapak

Proxmox Staff Member
Staff member
Feb 1, 2016
3,701
338
83
31
Vienna

FelixJ

New Member
Mar 1, 2019
9
1
3
39
I have applied the patch. Prior to that patch, the system would crash approximately every 2 days, however it also run up to 11 days without incident, depending on how many logins where.
So I have to wait a while (up to 2 weeks of usage) until it's safe to say, that this patch fixed the problem.

Meanwhile:
As far as I decrypt from the error message and the bug report, this issue has something to do with the virtual VGA graphics adapter. Is that right?
In this article (linked to the bug report you mentioned) the recommendation to mitigate this problem is to switch to cirrus graphics driver. Would this help as well?
regards,
Felix
 

dcsapak

Proxmox Staff Member
Staff member
Feb 1, 2016
3,701
338
83
31
Vienna
So I have to wait a while (up to 2 weeks of usage) until it's safe to say, that this patch fixed the problem.
great, i will wait for the result

Would this help as well?
could be, but note that cirrus has had many security issues in the past (this is the reason we changed the default and removed it from the gui)

you can still set it:
Code:
qm set VMID -vga cirrus
 

Whatever

Member
Nov 19, 2012
201
5
18
Dominik, what about using VirtioGPU would be a solution and are there any benefits of using it?
Thanks in advance
 

FelixJ

New Member
Mar 1, 2019
9
1
3
39
any news yet ?
Hi Good Morning,
Sorry, I couldn't get back any sooner.
Yes, so far there have not been any issues anymore and the system has been undergoing an extreme phase of login/logoffs as well as local console administration to ensure stability on the admin side too.
On my opinion, your patch solved the problem, and you could release it to productive environments.

regards,
Felix
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!