Out server crashed in production while live migrating.

potetpro · Sep 6, 2019

Hello.
One of our servers just rebooted while we live migrated a VM in production.
What logs do i need to get asap, before they are overwritten?

I My first guess is that it has to do with HA. The VM and 3 existing servers had HA configured.
The new machine that we were migrating, was not yet in the HA group.
Could that be why it restarted?

mira · Sep 6, 2019

Do you have the corosync network separated from the migration network?
Please post the output of 'pveversion -v' and the syslog (/var/log/syslog[.X]) or journal (journalctl --since "2019-09-05 00:00:00" or a little bit before the migration started, requires persistent journal) from that time.

potetpro · Sep 6, 2019

root@proxmox1:~# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-6 (running version: 6.0-6/c71f879f)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-4.15: 5.4-8
pve-kernel-5.0.21-1-pve: 5.0.21-1
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-15-pve: 4.15.18-40
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.2-pve1
ceph-fuse: 14.2.2-pve1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-5
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

https://pastebin.com/LUqv8w5S

potetpro · Sep 6, 2019

The new server was updated to pve-cluster 6.0.7
while the rest of the cluster was running pve-cluster 6.0.5

The new server also running pve-kernel-5.0.21-2, while the others was running 5.0.21-1

Dunno if that has something to do with it.

potetpro · Sep 9, 2019

Thought maybe it was because the new server was not in HA group, but when trying to recreate the crash in a virtual environment it did not crash.
Are the HA-groups necessary? How does proxmox determine what node is the best?

NoahD · Sep 9, 2019

Whenever I upgrade the PVE nodes I manually move the live VMs onto another node and then upgrade the empty node. I know some just upgrade and reboot to let HA handle the migrations but I only let HA handle it if the node actually failed unexpectedly. This way I make sure the migrations are 100% successful.

potetpro · Sep 9, 2019

Oh no no, i was manually migrating the VM's when it crashed. I was trying to recreate the senario to check if we did anything wrong.

I thought initially that the crash was caused by migrating a VM outside its HA-group. But that was not the case.

Search

Search

Out server crashed in production while live migrating.

potetpro

Member

mira

Proxmox Staff Member

potetpro

Member

potetpro

Member

potetpro

Member

NoahD

Member

potetpro

Member