We are in the process of migrating our last ESXi based host to Proxmox. The problem is the Proxmox host is crashing and rebooting. There are no errors in the log and the system is located at a remote server location (not hosted, our hardware and location). We have been running this system in a lab environment seemingly without issue. It was originally installed with Proxmox 3.x and upgraded to v4.x
Current output of pveversion -v is:
proxmox-ve: 4.1-33 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-5 (running version: 4.1-5/f910ef5c)
pve-kernel-4.2.6-1-pve: 4.2.6-33
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-30
qemu-server: 4.0-46
pve-firmware: 1.1-7
libpve-common-perl: 4.0-43
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-21
pve-container: 1.0-37
pve-firewall: 2.0-15
pve-ha-manager: 1.0-18
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
apt-get update && apt-get dist-upgrade lists that everything is up to date.
apt-get install proxmox-ve lists installed and up to date.
We migrated servers slowly over the past few weeks. The first three VMs were new and for testing, the next 7 VMs were created and installed from base OS, and then we transferred data from old ESXi VM to new Proxmox VM. The system ran in this state for about 5 days without issue. The stability problem started once we transferred a Windows Server 2008R2 VM, this was not reinstalled. We preformed a snapshot consolidation on the ESXi, removed VMware tools, copied the win2k8r2-flat.vmdk to our Proxmox. We created a new VM on our Proxmox system, once the VM was created we used the command:
dd if=win2k8r2-flat.vmdk of=/dev/zvol/ssd_02/vm-111-disk-1
This moved the raw vmdk to our ZFS storage pool. The VM booted without issue, we installed the balloon, virtio, and virt nic drivers from the virtio-win-0.1.112.iso. During the install of the balloon driver, the VM stopped responding, at this point we thought the Win2k8r2 had crashed. In fact the entire Proxmox host crashed. The Promox server at the remote location restarted and we were able to login and restart all the VMs, we installed the balloon driver without issue the second time. Everything appeared to be ok. We then changed the VM IO boot drive to SCSI and SCSI controller to VIRTIO. We booted the VM up installed the SCSI VIRTIO driver and preformed an sdelete -z c: to reclaim lost free space in the migration. Again everything completed fine.
However within a few hours the Proxmox host crashed again. We disabled balloon on the Win2k8r2 VM and set the memory to a fixed size. Still we had stability problems with the Wik28r2 (it continued to lock up every few hours), then the Proxmox host crashed again. We switched back to the VIRTIO as opposed to SCSI VIRTIO, and have tried pretty much every option we can find to diagnose the problem but every 4-6 hours the Proxmox Host would crash.
The problem appears completely random, the last crash occurred at 7:30AM EST this morning and we cannot find anything to help diagnose the issue. After this morning's crash We have shutdown the Win2k8R2 VM on Promox and have resumed using it on the ESXi. We are monitoring the Proxmox host to see if the problem continues with the Wik2k8R2 VM shutdown. So far no issues have occured with any of the other VMs or the Proxmox host.
In the /var/log/daemon.log we did find following:
systemd[1]: Cannot add dependency job for unit watchdog-mux.socket, ignoring: Unit watchdog-mux.socket failed to load: No such file or directory. The watchdog service (softdog) is reported as running as checked with systemctl service watchdog-mux.
Is there any place we can look for clues as to what is triggering the host to crash? and what can we do to get this error we are seeing fixed?
System is a Supermicro X10SLH-F with a Xeon E3-1246v3 CPU and 32 GB of Kingston ECC ram. System passed stability testing prior to deployment (memtest86, and various other CPU and ram stress utilities we started from a bootable usb key, not from within Proxmox).
Thanks
Current output of pveversion -v is:
proxmox-ve: 4.1-33 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-5 (running version: 4.1-5/f910ef5c)
pve-kernel-4.2.6-1-pve: 4.2.6-33
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-30
qemu-server: 4.0-46
pve-firmware: 1.1-7
libpve-common-perl: 4.0-43
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-21
pve-container: 1.0-37
pve-firewall: 2.0-15
pve-ha-manager: 1.0-18
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
apt-get update && apt-get dist-upgrade lists that everything is up to date.
apt-get install proxmox-ve lists installed and up to date.
We migrated servers slowly over the past few weeks. The first three VMs were new and for testing, the next 7 VMs were created and installed from base OS, and then we transferred data from old ESXi VM to new Proxmox VM. The system ran in this state for about 5 days without issue. The stability problem started once we transferred a Windows Server 2008R2 VM, this was not reinstalled. We preformed a snapshot consolidation on the ESXi, removed VMware tools, copied the win2k8r2-flat.vmdk to our Proxmox. We created a new VM on our Proxmox system, once the VM was created we used the command:
dd if=win2k8r2-flat.vmdk of=/dev/zvol/ssd_02/vm-111-disk-1
This moved the raw vmdk to our ZFS storage pool. The VM booted without issue, we installed the balloon, virtio, and virt nic drivers from the virtio-win-0.1.112.iso. During the install of the balloon driver, the VM stopped responding, at this point we thought the Win2k8r2 had crashed. In fact the entire Proxmox host crashed. The Promox server at the remote location restarted and we were able to login and restart all the VMs, we installed the balloon driver without issue the second time. Everything appeared to be ok. We then changed the VM IO boot drive to SCSI and SCSI controller to VIRTIO. We booted the VM up installed the SCSI VIRTIO driver and preformed an sdelete -z c: to reclaim lost free space in the migration. Again everything completed fine.
However within a few hours the Proxmox host crashed again. We disabled balloon on the Win2k8r2 VM and set the memory to a fixed size. Still we had stability problems with the Wik28r2 (it continued to lock up every few hours), then the Proxmox host crashed again. We switched back to the VIRTIO as opposed to SCSI VIRTIO, and have tried pretty much every option we can find to diagnose the problem but every 4-6 hours the Proxmox Host would crash.
The problem appears completely random, the last crash occurred at 7:30AM EST this morning and we cannot find anything to help diagnose the issue. After this morning's crash We have shutdown the Win2k8R2 VM on Promox and have resumed using it on the ESXi. We are monitoring the Proxmox host to see if the problem continues with the Wik2k8R2 VM shutdown. So far no issues have occured with any of the other VMs or the Proxmox host.
In the /var/log/daemon.log we did find following:
systemd[1]: Cannot add dependency job for unit watchdog-mux.socket, ignoring: Unit watchdog-mux.socket failed to load: No such file or directory. The watchdog service (softdog) is reported as running as checked with systemctl service watchdog-mux.
Is there any place we can look for clues as to what is triggering the host to crash? and what can we do to get this error we are seeing fixed?
System is a Supermicro X10SLH-F with a Xeon E3-1246v3 CPU and 32 GB of Kingston ECC ram. System passed stability testing prior to deployment (memtest86, and various other CPU and ram stress utilities we started from a bootable usb key, not from within Proxmox).
Thanks