Hello Everybody,
I have some serious problems regarding some Windows VMs (Server 2016 and 2012r2), which are stopping without any reason or trace to any log (either /var/log/syslog or (after vm reboot within Windows Eventviewer)).
The Server 2016 is running as MS AD DC, the 2012R2 Server is running as Remote Desktop Server.
It happens in a reproducible manner if a user (no matter admin or normal user) connects interactively to the Windows Guest.
It makes no difference if the connection is initiated via RDP or directly using the Web Console.
Usually I will connect only as Admin to the DC, either via RDP or Web Console, where as to the Remote-Desktop Server the users connect via RDP, while as me as Admin again using both RDP and Web Console.
To eliminate the "damaged Hardware issue" I have added a second node and created a cluster and migrated one of the Windows Guests to the other node to eliminate possible KSM issues.
I have also set up a new Windows Guest and joined it to the Domain. Same issue.
All 3 VMs have a running commercial AV engine running.
Ther's also a Ubuntu guest running, which seams not to be influenced by this...
Now the Host-Side:
I am running an up-to-date PVE-Cluster with 2 nodes (See pveversion -v) below on the no-subscription channel.
The underlying Hardware is in both cases HP Proliant DL360 (8xIntel(R) Xeon(R) CPU E5-2450L 0 @ 1.80GHz, 32GB ECC DDR3 and up-to-date Bios, Raid5 HP Smart Array P420).
The most anoying thing about this is, that there is no trace of a log entry, no evidence on the hypervisor-side, for why the VMs have been terminated.
This spooky behavior has started around Dec. 2018, and I am constantly upgrading to be sure to be up to date using the non-subscription channel.
dmesg -T will only display regarding the virtual interfaces that have removed / added to the vmbr after offline state and after manually restarting it:
[Fri Mar 1 09:14:37 2019] vmbr1: port 4(tap103i0) entered disabled state
[Fri Mar 1 09:14:37 2019] vmbr1: port 4(tap103i0) entered disabled state
[Fri Mar 1 09:49:32 2019] device tap103i0 entered promiscuous mode
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered blocking state
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered disabled state
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered blocking state
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered forwarding state
/var/log/syslog only shows, that the vm is gone: (failed to run vncproxy)
Mar 1 09:49:12 remote2 qm[9303]: VM 103 qmp command failed - VM 103 not running
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: HUP conn (1837-9303-32) (ipcs.c:759:qb_ipcs_dispatch_connection_request)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: qb_ipcs_disconnect(1837-9303-32) state:2 (ipcs.c:606:qb_ipcs_disconnect)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: Connection to pid:9303 destroyed (server.c:147:s1_connection_closed_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: connection about to be freed (server.c:132:s1_connection_destroyed_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: Free'ing ringbuffer: /dev/shm/qb-pve2-response-1837-9303-32-header (ringbuffer_helper.c:337:qb_rb_close_helper)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: Free'ing ringbuffer: /dev/shm/qb-pve2-event-1837-9303-32-header (ringbuffer_helper.c:337:qb_rb_close_helper)
Mar 1 09:49:12 remote2 pvedaemon[9301]: Failed to run vncproxy.
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: Free'ing ringbuffer: /dev/shm/qb-pve2-request-1837-9303-32-header (ringbuffer_helper.c:337:qb_rb_close_helper)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process msg:4, size:5460 (server.c:168:s1_msg_process_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process result 0 (server.c:318:s1_msg_process_fn)
Mar 1 09:49:12 remote2 pvedaemon[7109]: <root@pam> end task UPID:remote2:00002455:03E3F31C:5C78F206:vncproxy:103:root@pam: Failed to run vncproxy.
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process msg:7, size:134 (server.c:168:s1_msg_process_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process result 0 (server.c:318:s1_msg_process_fn)
pveversion -v:
proxmox-ve: 5.3-1 (running kernel: 4.15.18-11-pve)
pve-manager: 5.3-9 (running version: 5.3-9/ba817b29)
pve-kernel-4.15: 5.3-2
pve-kernel-4.15.18-11-pve: 4.15.18-33
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-46
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-38
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-34
pve-docs: 5.3-2
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-46
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1
Here are the VM Config files:
cat /etc/pve/qemu-server/100.conf (This is the Server 2016, which is acting as DC and is affected when I connect via WebUI or RDP):
agent: 1
bootdisk: virtio0
cores: 2
cpu: host
ide0: none,media=cdrom
memory: 4096
name: dc-zstahl-01
net0: virtio=E6:27F:04:4D:2C,bridge=vmbr1
numa: 0
onboot: 1
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=5490fb8e-33b2-4555-ad0e-d244b6660bf6
sockets: 1
startup: order=1
vga: qxl
virtio0: local-lvm:vm-100-disk-0,format=raw,size=60G
virtio1: local-lvm:vm-100-disk-1,format=raw,size=500G
cat /etc/pve/qemu-server/101.conf
agent: 1
args: -device intel-hda,id=sound5,bus=pci.0,addr=0x18 -device hda-micro,id=sound5-codec0,bus=sound5.0,cad=0 -device hda-duplex,id=sound5-codec1,bus=sound5.0,cad=1
bootdisk: virtio0
cores: 2
hotplug: disk,usb
ide0: none,media=cdrom
ide2: none,media=cdrom
memory: 15360
name: ts-zstahl-01
net0: virtio=9E:92E:0A:4E:CE,bridge=vmbr1
numa: 0
onboot: 1
ostype: win8
scsihw: virtio-scsi-pci
smbios1: uuid=170fccd6-fe7f-42af-a6fc-506a2248f493
sockets: 3
startup: order=2,up=10
vga: qxl
virtio0: local-lvm:vm-101-disk-0,format=raw,size=100G
cat /etc/pve/qemu-server/103.conf(this is the all new Server 2012R2 which I have set up to debug) and which is also affected:
bootdisk: virtio0
cores: 4
ide2: local:iso/SW_DVD9_Windows_Svr_Std_and_DataCtr_2012_R2_64Bit_German_-4_MLF_X19-82895.ISO,media=cdrom
ide3: local:iso/virtio-win.iso,media=cdrom,size=367806K
memory: 15360
name: ts-zstahl-02
net0: virtio=C2:9B4:56:6E:0F,bridge=vmbr1
numa: 0
ostype: win8
parent: BasisInstallation
scsihw: virtio-scsi-pci
smbios1: uuid=c7449310-bdc8-4264-9710-1a508ec27d8c
sockets: 1
virtio0: local-lvm:vm-103-disk-0,size=100G
vmgenid: 4b1b8d9e-2647-4602-9167-ebab7ed19f78
cat /etc/pve/qemu-server/102.conf (this is the Ubuntu VM, which is not affected when I connect to the WebUI Console, neither when I connect via ssh!)
agent: 1
bootdisk: virtio0
cores: 1
ide2: none,media=cdrom
memory: 512
name: vpn-zstahl-01
net0: virtio=9A:54:F7:C0:00:17,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=be026919-714a-4f79-87a1-c2ba66276b95
sockets: 1
startup: order=3
vga: qxl
virtio0: local-lvm:vm-102-disk-0,format=raw,size=10G
I hope, I have delivered all necessary information to help someone like you to help me resolve that problem!
Thank you in advanced for helping me!
Felix
I have some serious problems regarding some Windows VMs (Server 2016 and 2012r2), which are stopping without any reason or trace to any log (either /var/log/syslog or (after vm reboot within Windows Eventviewer)).
The Server 2016 is running as MS AD DC, the 2012R2 Server is running as Remote Desktop Server.
It happens in a reproducible manner if a user (no matter admin or normal user) connects interactively to the Windows Guest.
It makes no difference if the connection is initiated via RDP or directly using the Web Console.
Usually I will connect only as Admin to the DC, either via RDP or Web Console, where as to the Remote-Desktop Server the users connect via RDP, while as me as Admin again using both RDP and Web Console.
To eliminate the "damaged Hardware issue" I have added a second node and created a cluster and migrated one of the Windows Guests to the other node to eliminate possible KSM issues.
I have also set up a new Windows Guest and joined it to the Domain. Same issue.
All 3 VMs have a running commercial AV engine running.
Ther's also a Ubuntu guest running, which seams not to be influenced by this...
Now the Host-Side:
I am running an up-to-date PVE-Cluster with 2 nodes (See pveversion -v) below on the no-subscription channel.
The underlying Hardware is in both cases HP Proliant DL360 (8xIntel(R) Xeon(R) CPU E5-2450L 0 @ 1.80GHz, 32GB ECC DDR3 and up-to-date Bios, Raid5 HP Smart Array P420).
The most anoying thing about this is, that there is no trace of a log entry, no evidence on the hypervisor-side, for why the VMs have been terminated.
This spooky behavior has started around Dec. 2018, and I am constantly upgrading to be sure to be up to date using the non-subscription channel.
dmesg -T will only display regarding the virtual interfaces that have removed / added to the vmbr after offline state and after manually restarting it:
[Fri Mar 1 09:14:37 2019] vmbr1: port 4(tap103i0) entered disabled state
[Fri Mar 1 09:14:37 2019] vmbr1: port 4(tap103i0) entered disabled state
[Fri Mar 1 09:49:32 2019] device tap103i0 entered promiscuous mode
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered blocking state
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered disabled state
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered blocking state
[Fri Mar 1 09:49:32 2019] vmbr1: port 4(tap103i0) entered forwarding state
/var/log/syslog only shows, that the vm is gone: (failed to run vncproxy)
Mar 1 09:49:12 remote2 qm[9303]: VM 103 qmp command failed - VM 103 not running
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: HUP conn (1837-9303-32) (ipcs.c:759:qb_ipcs_dispatch_connection_request)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: qb_ipcs_disconnect(1837-9303-32) state:2 (ipcs.c:606:qb_ipcs_disconnect)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: Connection to pid:9303 destroyed (server.c:147:s1_connection_closed_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: connection about to be freed (server.c:132:s1_connection_destroyed_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: Free'ing ringbuffer: /dev/shm/qb-pve2-response-1837-9303-32-header (ringbuffer_helper.c:337:qb_rb_close_helper)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: Free'ing ringbuffer: /dev/shm/qb-pve2-event-1837-9303-32-header (ringbuffer_helper.c:337:qb_rb_close_helper)
Mar 1 09:49:12 remote2 pvedaemon[9301]: Failed to run vncproxy.
Mar 1 09:49:12 remote2 pmxcfs[1837]: [libqb] debug: Free'ing ringbuffer: /dev/shm/qb-pve2-request-1837-9303-32-header (ringbuffer_helper.c:337:qb_rb_close_helper)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process msg:4, size:5460 (server.c:168:s1_msg_process_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process result 0 (server.c:318:s1_msg_process_fn)
Mar 1 09:49:12 remote2 pvedaemon[7109]: <root@pam> end task UPID:remote2:00002455:03E3F31C:5C78F206:vncproxy:103:root@pam: Failed to run vncproxy.
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process msg:7, size:134 (server.c:168:s1_msg_process_fn)
Mar 1 09:49:12 remote2 pmxcfs[1837]: [ipcs] debug: process result 0 (server.c:318:s1_msg_process_fn)
pveversion -v:
proxmox-ve: 5.3-1 (running kernel: 4.15.18-11-pve)
pve-manager: 5.3-9 (running version: 5.3-9/ba817b29)
pve-kernel-4.15: 5.3-2
pve-kernel-4.15.18-11-pve: 4.15.18-33
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-46
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-38
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-34
pve-docs: 5.3-2
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-46
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1
Here are the VM Config files:
cat /etc/pve/qemu-server/100.conf (This is the Server 2016, which is acting as DC and is affected when I connect via WebUI or RDP):
agent: 1
bootdisk: virtio0
cores: 2
cpu: host
ide0: none,media=cdrom
memory: 4096
name: dc-zstahl-01
net0: virtio=E6:27F:04:4D:2C,bridge=vmbr1
numa: 0
onboot: 1
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=5490fb8e-33b2-4555-ad0e-d244b6660bf6
sockets: 1
startup: order=1
vga: qxl
virtio0: local-lvm:vm-100-disk-0,format=raw,size=60G
virtio1: local-lvm:vm-100-disk-1,format=raw,size=500G
cat /etc/pve/qemu-server/101.conf
agent: 1
args: -device intel-hda,id=sound5,bus=pci.0,addr=0x18 -device hda-micro,id=sound5-codec0,bus=sound5.0,cad=0 -device hda-duplex,id=sound5-codec1,bus=sound5.0,cad=1
bootdisk: virtio0
cores: 2
hotplug: disk,usb
ide0: none,media=cdrom
ide2: none,media=cdrom
memory: 15360
name: ts-zstahl-01
net0: virtio=9E:92E:0A:4E:CE,bridge=vmbr1
numa: 0
onboot: 1
ostype: win8
scsihw: virtio-scsi-pci
smbios1: uuid=170fccd6-fe7f-42af-a6fc-506a2248f493
sockets: 3
startup: order=2,up=10
vga: qxl
virtio0: local-lvm:vm-101-disk-0,format=raw,size=100G
cat /etc/pve/qemu-server/103.conf(this is the all new Server 2012R2 which I have set up to debug) and which is also affected:
bootdisk: virtio0
cores: 4
ide2: local:iso/SW_DVD9_Windows_Svr_Std_and_DataCtr_2012_R2_64Bit_German_-4_MLF_X19-82895.ISO,media=cdrom
ide3: local:iso/virtio-win.iso,media=cdrom,size=367806K
memory: 15360
name: ts-zstahl-02
net0: virtio=C2:9B4:56:6E:0F,bridge=vmbr1
numa: 0
ostype: win8
parent: BasisInstallation
scsihw: virtio-scsi-pci
smbios1: uuid=c7449310-bdc8-4264-9710-1a508ec27d8c
sockets: 1
virtio0: local-lvm:vm-103-disk-0,size=100G
vmgenid: 4b1b8d9e-2647-4602-9167-ebab7ed19f78
cat /etc/pve/qemu-server/102.conf (this is the Ubuntu VM, which is not affected when I connect to the WebUI Console, neither when I connect via ssh!)
agent: 1
bootdisk: virtio0
cores: 1
ide2: none,media=cdrom
memory: 512
name: vpn-zstahl-01
net0: virtio=9A:54:F7:C0:00:17,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=be026919-714a-4f79-87a1-c2ba66276b95
sockets: 1
startup: order=3
vga: qxl
virtio0: local-lvm:vm-102-disk-0,format=raw,size=10G
I hope, I have delivered all necessary information to help someone like you to help me resolve that problem!
Thank you in advanced for helping me!
Felix