Proxmox Host Crashing - Ideas?

MyThoughts · Jan 19, 2016

We are in the process of migrating our last ESXi based host to Proxmox. The problem is the Proxmox host is crashing and rebooting. There are no errors in the log and the system is located at a remote server location (not hosted, our hardware and location). We have been running this system in a lab environment seemingly without issue. It was originally installed with Proxmox 3.x and upgraded to v4.x

Current output of pveversion -v is:
proxmox-ve: 4.1-33 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-5 (running version: 4.1-5/f910ef5c)
pve-kernel-4.2.6-1-pve: 4.2.6-33
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-30
qemu-server: 4.0-46
pve-firmware: 1.1-7
libpve-common-perl: 4.0-43
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-21
pve-container: 1.0-37
pve-firewall: 2.0-15
pve-ha-manager: 1.0-18
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie

apt-get update && apt-get dist-upgrade lists that everything is up to date.
apt-get install proxmox-ve lists installed and up to date.

We migrated servers slowly over the past few weeks. The first three VMs were new and for testing, the next 7 VMs were created and installed from base OS, and then we transferred data from old ESXi VM to new Proxmox VM. The system ran in this state for about 5 days without issue. The stability problem started once we transferred a Windows Server 2008R2 VM, this was not reinstalled. We preformed a snapshot consolidation on the ESXi, removed VMware tools, copied the win2k8r2-flat.vmdk to our Proxmox. We created a new VM on our Proxmox system, once the VM was created we used the command:
dd if=win2k8r2-flat.vmdk of=/dev/zvol/ssd_02/vm-111-disk-1

This moved the raw vmdk to our ZFS storage pool. The VM booted without issue, we installed the balloon, virtio, and virt nic drivers from the virtio-win-0.1.112.iso. During the install of the balloon driver, the VM stopped responding, at this point we thought the Win2k8r2 had crashed. In fact the entire Proxmox host crashed. The Promox server at the remote location restarted and we were able to login and restart all the VMs, we installed the balloon driver without issue the second time. Everything appeared to be ok. We then changed the VM IO boot drive to SCSI and SCSI controller to VIRTIO. We booted the VM up installed the SCSI VIRTIO driver and preformed an sdelete -z c: to reclaim lost free space in the migration. Again everything completed fine.
However within a few hours the Proxmox host crashed again. We disabled balloon on the Win2k8r2 VM and set the memory to a fixed size. Still we had stability problems with the Wik28r2 (it continued to lock up every few hours), then the Proxmox host crashed again. We switched back to the VIRTIO as opposed to SCSI VIRTIO, and have tried pretty much every option we can find to diagnose the problem but every 4-6 hours the Proxmox Host would crash.

The problem appears completely random, the last crash occurred at 7:30AM EST this morning and we cannot find anything to help diagnose the issue. After this morning's crash We have shutdown the Win2k8R2 VM on Promox and have resumed using it on the ESXi. We are monitoring the Proxmox host to see if the problem continues with the Wik2k8R2 VM shutdown. So far no issues have occured with any of the other VMs or the Proxmox host.

In the /var/log/daemon.log we did find following:
systemd[1]: Cannot add dependency job for unit watchdog-mux.socket, ignoring: Unit watchdog-mux.socket failed to load: No such file or directory. The watchdog service (softdog) is reported as running as checked with systemctl service watchdog-mux.

Is there any place we can look for clues as to what is triggering the host to crash? and what can we do to get this error we are seeing fixed?

System is a Supermicro X10SLH-F with a Xeon E3-1246v3 CPU and 32 GB of Kingston ECC ram. System passed stability testing prior to deployment (memtest86, and various other CPU and ram stress utilities we started from a bootable usb key, not from within Proxmox).

Thanks

MyThoughts · Jan 19, 2016

Well just a quick update, the Proxmox Host crashed and rebooting itself a few minutes ago. Unfortunately that means I have to rule out the Win2k8R2 as the problem.

RobFantini · Jan 20, 2016

have you checked /var/log/syslog lines before reboot?

fireon · Jan 20, 2016

Badly problem. That can be a lot of things. But let us start to see in the Logs, when the server crashed. You can use journalctl to search in log entries. With the following command you can see all logs from an specific boot.

journalctl --list-boots

then you can use an bootentry from the list like this:

journalctl -b 67810be3ff5f4047b38d207b26bc18b7 -r

You should see something interessting things. Look at the logtimes.

MyThoughts · Jan 20, 2016

We had looked at every log we could think of that may contain info, syslog, daemon.log, messages, kern.log.
They log simply showed everything working on one line and the next was a line indicating it was booting.

As the system has IPMI capability we decided to record the console, hoping to catch some sort of kernel panic error. Unfortunately we didn't catch anything, one second the unit was running the next the bios screen popped up indicating a reboot.

We have power monitors so we can see live power draw at the wall. We know power was not an issue as the IPMI console does not disconnect, and the power draw at the wall doesn't really change.

This is our forth ESXI to Proxmox migration, and the first that we are having problems with. We went back and made a list of what could be unique about this migration. We do have one unique aspect, the ESXi host had a VM with pcie pass through of 4 network cards. This is a requirement as the VM is running DPI as a remote network routing 'hub.' We did configure a VM on the Proxmox with the same setup (4 x pcie passthrough of nics), it was one of the first VMs setup on this machine and it ran without issue for ~6d without problems.

Through much testing and diagnosis we have been able to isolate the problem to the point where we can reproduce it on demand.

This particular host has 32G of ram, in order to test memory over commit we shutdown all of our VMs and created a new VM with no HD, attached a memtest86+ v5.01 ISO and configured it with 32G of ram. Sure enough the VM boots without issue and begins testing ram, monitoring the host via the IPMI console we can see all 32G of ram is allocated and ~2.5G of swap is used, in line with what we expected. We shutdown the VM, edited the vm conf with the following:

hostpci0: 04:00.0,pcie=1
hostpci1: 05:00.0,pcie=1
hostpci2: 06:00.0,pcie=1
hostpci3: 07:00.0,pcie=1

This would or should enable pcie passthrough for the nics. The VM would not start, and during the attempt to start would trigger our mysterious crash. Now if we lower the amount of ram for the VM it will boot. It appears to be a bug or limitation of the memory over commit in Proxmox whereas if the host system hits a physical ram limit and pcie pass though is enabled the entire host with crash.

The only thing we found was the following lines just before the system crashed when attempting to start the VM:

Use of uninitialized value $kvmver in pattern match (m//) at /usr/share/perl5/PVE/QemuServer.pm line 6409.
Use of uninitialized value $current_major in numeric ge (>=) at /usr/share/perl5/PVE/QemuServer.pm line 6415.

This is the easiest way to trigger the problem, however we were able to trigger is via other means. Using stress-ng to max the memory usage or starting many other VMs while the vm with pcie pass through is enabled.

I know, memory over commit is not recommended, and pci/pcie passthrough is experimental. We were just matching how our ESXi host was setup. Most of the VMs we are running use very little ram, so we have min/max set in the ballooning as 1G/8G on almost all the VMs. During normal operation physical ram has never reported as being 100% used, and swap file usage is always reported as 0.

Q-wulf · Jan 20, 2016

do you use zfs and your swap is on said ZFS ?
If so, check https://forum.proxmox.com/threads/zfs-swap-crashes-system.25208/#post-126215

MyThoughts said:
Use of uninitialized value $kvmver in pattern match (m//) at /usr/share/perl5/PVE/QemuServer.pm line 6409.
Use of uninitialized value $current_major in numeric ge (>=) at /usr/share/perl5/PVE/QemuServer.pm line 6415.

is totally unrelated, paging @spirit to confirm.

MyThoughts · Jan 21, 2016

Q-wulf said:
do you use zfs and your swap is on said ZFS ?
If so, check https://forum.proxmox.com/threads/zfs-swap-crashes-system.25208/#post-126215

https://forum.proxmox.com/threads/zfs-swap-crashes-system.25208/#post-126215

We do use ZFS on all our Proxmox hosts. We initially moved the swap off of the ZFS and to a separate partition on a different drive.
I just finished moving the swap back onto a ZFS zvol a few minutes ago, and will be closely monitoring the situation now that you pointed out the potential issue we could get from that.

So far keeping the VM that needs the pcie passthough off has resulted in a stable system.

Search

Search

Proxmox Host Crashing - Ideas?

MyThoughts

Active Member

MyThoughts

Active Member

RobFantini

Famous Member

fireon

Distinguished Member

MyThoughts

Active Member

Q-wulf

Well-Known Member

MyThoughts

Active Member