Out server crashed in production while live migrating.

Dec 26, 2018
138
2
23
35
Hello.
One of our servers just rebooted while we live migrated a VM in production.
What logs do i need to get asap, before they are overwritten?

I My first guess is that it has to do with HA. The VM and 3 existing servers had HA configured.
The new machine that we were migrating, was not yet in the HA group.
Could that be why it restarted?
 
Do you have the corosync network separated from the migration network?
Please post the output of 'pveversion -v' and the syslog (/var/log/syslog[.X]) or journal (journalctl --since "2019-09-05 00:00:00" or a little bit before the migration started, requires persistent journal) from that time.
 
root@proxmox1:~# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-6 (running version: 6.0-6/c71f879f)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-4.15: 5.4-8
pve-kernel-5.0.21-1-pve: 5.0.21-1
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-15-pve: 4.15.18-40
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.2-pve1
ceph-fuse: 14.2.2-pve1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-5
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

https://pastebin.com/LUqv8w5S
 
The new server was updated to pve-cluster 6.0.7
while the rest of the cluster was running pve-cluster 6.0.5

The new server also running pve-kernel-5.0.21-2, while the others was running 5.0.21-1

Dunno if that has something to do with it.
 
Thought maybe it was because the new server was not in HA group, but when trying to recreate the crash in a virtual environment it did not crash.
Are the HA-groups necessary? How does proxmox determine what node is the best?
 
Whenever I upgrade the PVE nodes I manually move the live VMs onto another node and then upgrade the empty node. I know some just upgrade and reboot to let HA handle the migrations but I only let HA handle it if the node actually failed unexpectedly. This way I make sure the migrations are 100% successful.
 
Oh no no, i was manually migrating the VM's when it crashed. I was trying to recreate the senario to check if we did anything wrong. :)
I thought initially that the crash was caused by migrating a VM outside its HA-group. But that was not the case.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!