Migrated/virtualized VMs crashing

hellfire · Sep 24, 2020

Hi,

I have a proxmox ve cluster for quite some time now and it is working stable for newly installed vms(qemu). But for (linux) VMs that are migrated, it's not stable. The migrated vms are either from dedicated servers or Citrix XenServer HVM. Either type is causing the problems. The VMs in question just get stuck from time to time. No reaction any more. The console stays blank. The only possibility is to kill the vm and
start it again.

For ~ 50 such vms this happens 1-5 times per day for all those vms. The reboot happens automatically by observation and
trigger from our monitoring system.

The VMs get stuck on any node - no difference if I move them from one node to another.

This is the software versioning on all the servers:

Code:

ii  pve-cluster                          6.1-8                           amd64        "pmxcfs" distributed cluster filesystem for Proxmox Virtual Environment.
ii  pve-container                        3.1-12                          all          Proxmox VE Container management tool
ii  pve-docs                             6.2-5                           all          Proxmox VE Documentation
ii  pve-edk2-firmware                    2.20200531-1                    all          edk2 based firmware modules for virtual machines
ii  pve-firewall                         4.1-2                           amd64        Proxmox VE Firewall
ii  pve-firmware                         3.1-1                           all          Binary firmware code for the pve-kernel
ii  pve-ha-manager                       3.0-9                           amd64        Proxmox VE HA Manager
ii  pve-i18n                             2.1-3                           all          Internationalization support for Proxmox VE
ii  pve-kernel-5.4                       6.2-4                           all          Latest Proxmox VE Kernel Image
ii  pve-kernel-5.4.44-1-pve              5.4.44-1                        amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.4.44-2-pve              5.4.44-2                        amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-helper                    6.2-4                           all          Function for various kernel maintenance tasks.
ii  pve-lxc-syscalld                     0.9.1-1                         amd64        PVE LXC syscall daemon
ii  pve-manager                          6.2-10                          amd64        Proxmox Virtual Environment Management Tools
ii  pve-qemu-kvm                         5.0.0-11                        amd64        Full virtualization on x86 hardware
ii  pve-xtermjs                          4.3.0-1                         all          HTML/JS Shell client
root@px02:~# uname -a
Linux px02 5.4.44-2-pve #1 SMP PVE 5.4.44-2 (Wed, 01 Jul 2020 16:37:57 +0200) x86_64 GNU/Linux

Is there anything I can verify or check to get the cause of the issue?

Regards,
h.

oguz · Sep 28, 2020

hi,

sounds weird for sure. could you post an example configuration of such VM? qm config VMID where VMID is the ID of a VM

also please post pveversion -v for the full output of all proxmox packages versions

do you see anything interesting in the logs from the PVE (task logs, syslog, journalctl, dmesg etc.) or your monitoring system while this hang occurs?

hellfire · Sep 29, 2020

Version details

Code:

root@px01:~# pveversion -v  
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-10 (running version: 6.2-10/a20769ed)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-1
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-12
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-11
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-11
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

VM example config

Code:

# qm config 125

agent: 1
bootdisk: scsi0
cores: 4
cpuunits: 256
ide2: none,media=cdrom
memory: 4096
name: myvm1.domain.tld
net0: e1000=56:77:5C:A2:9B:88,bridge=vmbr1,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: localdisk:125/vm-125-disk-0.qcow2,cache=unsafe,discard=on,format=qcow2,size=20G
smbios1: uuid=bd928d92-3ffe-4fbc-a714-445f79313b99
sockets: 1
vmgenid: e243fb83-f086-477a-a8a9-e4e0a564615c

hellfire · Sep 29, 2020

Last reboot was today at 13:27 for vm id 125:

syslog

Code:

Sep 29 13:25:28 px01 postfix/smtpd[27870]: warning: hostname sever54.centerandpark.net does not resolve to address 103.253.42.54: Name or service not known
Sep 29 13:25:28 px01 postfix/smtpd[27870]: connect from unknown[103.253.42.54]
Sep 29 13:25:29 px01 postfix/smtpd[27870]: disconnect from unknown[103.253.42.54] ehlo=1 auth=0/1 quit=1 commands=2/3
Sep 29 13:26:00 px01 systemd[1]: Starting Proxmox VE replication runner...
Sep 29 13:26:01 px01 systemd[1]: pvesr.service: Succeeded.
Sep 29 13:26:01 px01 systemd[1]: Started Proxmox VE replication runner.
Sep 29 13:27:00 px01 systemd[1]: Starting Proxmox VE replication runner...
Sep 29 13:27:01 px01 systemd[1]: pvesr.service: Succeeded.
Sep 29 13:27:01 px01 systemd[1]: Started Proxmox VE replication runner.
Sep 29 13:27:58 px01 pvedaemon[28061]: <root@pam> successful auth for user 'root@pam'
Sep 29 13:28:00 px01 systemd[1]: Starting Proxmox VE replication runner...
Sep 29 13:28:01 px01 systemd[1]: pvesr.service: Succeeded.
Sep 29 13:28:01 px01 systemd[1]: Started Proxmox VE replication runner.
Sep 29 13:28:03 px01 pvedaemon[4387]: shutdown VM 125: UPID:px01:00001123:1462DF69:5F731A43:qmshutdown:125:root@pam:
Sep 29 13:28:03 px01 pvedaemon[9565]: <root@pam> starting task UPID:px01:00001123:1462DF69:5F731A43:qmshutdown:125:root@pam:

hellfire · Sep 29, 2020

I checked dmesg of 2 servers(both have failing vms):

server 1(older one):

occasionally corrected RAM errors like that:

Code:

[So Sep 27 07:44:45 2020] EDAC MC0: 1 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
[So Sep 27 15:25:01 2020] EDAC MC0: 1 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
[So Sep 27 16:49:30 2020] EDAC MC0: 1 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
[Mo Sep 28 04:11:10 2020] EDAC MC0: 1 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
[Mo Sep 28 15:24:57 2020] EDAC MC0: 2 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
[Mo Sep 28 16:49:27 2020] EDAC MC0: 1 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
[Di Sep 29 04:11:52 2020] EDAC MC0: 1 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
[Di Sep 29 09:41:32 2020] EDAC MC0: 1 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)

occasionally hardware errors like that:

Code:

[So Sep 27 15:25:01 2020] mce: [Hardware Error]: Machine check events logged
[So Sep 27 16:49:30 2020] mce: [Hardware Error]: Machine check events logged
[Mo Sep 28 04:11:10 2020] mce: [Hardware Error]: Machine check events logged
[Mo Sep 28 15:24:57 2020] mce: [Hardware Error]: Machine check events logged
[Mo Sep 28 16:49:27 2020] mce: [Hardware Error]: Machine check events logged
[Di Sep 29 04:11:52 2020] mce: [Hardware Error]: Machine check events logged
[Di Sep 29 09:41:32 2020] mce: [Hardware Error]: Machine check events logged

server 2(newer model):

none of the symptoms of server 1

I'd installed mcelog successor rasdaemon to figure out details of the loglines with "Hardware Error". (Data not yet available).

The time of the above errors are not at the same times as when the VMs are getting unavaiable.

Attached there is a hardware overview of both servers.

Update: Since the time of the "Hardware Error" and the EDAC/RAM-CE Errors are the same, I assume the hardware error is referencing the corrected memory error.

hellfire · Sep 29, 2020

Hmm... Seems replacing RAM-module mc0/dimm6 is a good idea(481 errors in ~40 days):

Code:

root@px01:~# grep . /sys/devices/system/edac/mc/mc*/dimm*/dimm*ce_count

/sys/devices/system/edac/mc/mc0/dimm0/dimm_ce_count:0
/sys/devices/system/edac/mc/mc0/dimm1/dimm_ce_count:0
/sys/devices/system/edac/mc/mc0/dimm2/dimm_ce_count:0
/sys/devices/system/edac/mc/mc0/dimm3/dimm_ce_count:0
/sys/devices/system/edac/mc/mc0/dimm4/dimm_ce_count:0
/sys/devices/system/edac/mc/mc0/dimm5/dimm_ce_count:0
/sys/devices/system/edac/mc/mc0/dimm6/dimm_ce_count:481
/sys/devices/system/edac/mc/mc0/dimm7/dimm_ce_count:0
/sys/devices/system/edac/mc/mc0/dimm8/dimm_ce_count:0
/sys/devices/system/edac/mc/mc1/dimm0/dimm_ce_count:0
/sys/devices/system/edac/mc/mc1/dimm1/dimm_ce_count:0
/sys/devices/system/edac/mc/mc1/dimm2/dimm_ce_count:0
/sys/devices/system/edac/mc/mc1/dimm3/dimm_ce_count:0
/sys/devices/system/edac/mc/mc1/dimm4/dimm_ce_count:0
/sys/devices/system/edac/mc/mc1/dimm5/dimm_ce_count:0
/sys/devices/system/edac/mc/mc1/dimm6/dimm_ce_count:0
/sys/devices/system/edac/mc/mc1/dimm7/dimm_ce_count:0
/sys/devices/system/edac/mc/mc1/dimm8/dimm_ce_count:0

Since server 2 does not report those issues and still vms are failing on that server, I think this is not the issue, but I'll replace that module for sure.

Search

Search

Migrated/virtualized VMs crashing

hellfire

Well-Known Member

oguz

Proxmox Retired Staff

hellfire

Well-Known Member

hellfire

Well-Known Member

hellfire

Well-Known Member

Attachments

hellfire

Well-Known Member