[SOLVED] Trying to install a new VM crashes the entire Proxmox cluster

Heracles31

New Member
Aug 4, 2024
16
3
3
Hi,

Running a 2 nodes + 1 QDevice cluster here. Both nodes are running 8.2.7 (no subscription repo).

Code:
root@pmx-a:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.12-2-pve)
pve-manager: 8.2.7 (running version: 8.2.7/3e0176e6bb2ade3b)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-2
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.3
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.1
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.10
libpve-storage-perl: 8.2.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-4
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.2.0
pve-docs: 8.2.3
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.0.7
pve-firmware: 3.13-2
pve-ha-manager: 4.0.5
pve-i18n: 3.2.3
pve-qemu-kvm: 9.0.2-3
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.4
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

I tried to install QRadar Community Edition in a VM on the first node yesterday. ISO is downloaded from IBM and SHA256 checksum has been validated, so no corruption here
Code:
(sha256 cc8311deabc90762110d2e1ac978f2a67828077466b2900caa8566018a34d541)

During the install process (which starts by installing Red Hat), the entire Proxmox node (pmx-a) crashed and rebooted. If the second node took the charge for a moment, it too crashed a little bit after the first, taking down my entire environment. I had to manually re-sync my MariaDB cluster, my Starwind vSAN ended up degraded too, ... After rebuilding everything, I just re-tried today. Same result! Everything crashed again in the very same way.

1-How the deployment of a new VM in a Proxmox node can crashes that node ?

2-How that crash of a first Proxmox node in a cluster can crashes the other node ?

Both nodes are FC630s in the very same FX2S Dell system. Each has an FD332 storage node with a total of 16 drives. Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz and 192G of RAM each. They are far to be loaded, Data Center's Summary showing CPU usage at 3%, memory at 28% and Storage at 55. pvecm status also shows that before the crash and after the reboot, both nodes and the QDevice are online.

Corosync has triple redundancy, first link being a physically dedicated 1G network direct cable (no switch).

About the QDevice, it is reached over a site-to-site VPN. That VPN is managed by pfSense, itself running as VMs in each node. pfSense-A is active by default and pfSense-B will take over in case of failure. When pmx-a crashes, pfSense-A goes down with it. It will take a minimum of time for the failover to complete, so that may be why pmx-b is impacted ? That failover takes too long, pmx-b looses quorum and will rather crash itself instead of running ?

When looking for clues, I get this :
Code:
root@pmx-a:~# journalctl -p err -f
Oct 28 18:12:21 pmx-a pvedaemon[3645]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:12:41 pmx-a pvedaemon[3644]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:13:01 pmx-a pvedaemon[3643]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:13:21 pmx-a pvedaemon[3645]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:13:41 pmx-a pvedaemon[3644]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:14:01 pmx-a pvedaemon[3643]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:14:21 pmx-a pvedaemon[3644]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:14:41 pmx-a pvedaemon[3643]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:15:01 pmx-a pvedaemon[3644]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:15:13 pmx-a pvedaemon[3643]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout

Indeed, VM 1060 is the one I try to install (QRadar CE). The thing is, these lines are AFTER the crash.

Same if I look at the same info from node B. The errors are the consequences of the crash after the reboot. Nothing about what may have cause this in the first place.

When scrolling in journalctl for whatever happened, I get only this :
Code:
Oct 28 17:36:37 pmx-a pvedaemon[3638]: <root@pam> starting task UPID:pmx-a:000AA92C:0090BB11:672003E5:qmstart:1060:root@pam:
Oct 28 17:36:37 pmx-a pvedaemon[698668]: start VM 1060: UPID:pmx-a:000AA92C:0090BB11:672003E5:qmstart:1060:root@pam:
Oct 28 17:36:37 pmx-a systemd[1]: Started 1060.scope.
Oct 28 17:36:38 pmx-a kernel: tap1060i0: entered promiscuous mode
Oct 28 17:36:38 pmx-a kernel: Infra: port 3(fwpr1060p0) entered blocking state
Oct 28 17:36:38 pmx-a kernel: Infra: port 3(fwpr1060p0) entered disabled state
Oct 28 17:36:38 pmx-a kernel: fwpr1060p0: entered allmulticast mode
Oct 28 17:36:38 pmx-a kernel: fwpr1060p0: entered promiscuous mode
Oct 28 17:36:38 pmx-a kernel: Infra: port 3(fwpr1060p0) entered blocking state
Oct 28 17:36:38 pmx-a kernel: Infra: port 3(fwpr1060p0) entered forwarding state
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 1(fwln1060i0) entered blocking state
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 1(fwln1060i0) entered disabled state
Oct 28 17:36:38 pmx-a kernel: fwln1060i0: entered allmulticast mode
Oct 28 17:36:38 pmx-a kernel: fwln1060i0: entered promiscuous mode
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 1(fwln1060i0) entered blocking state
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 1(fwln1060i0) entered forwarding state
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 2(tap1060i0) entered blocking state
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 2(tap1060i0) entered disabled state
Oct 28 17:36:38 pmx-a kernel: tap1060i0: entered allmulticast mode
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 2(tap1060i0) entered blocking state
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 2(tap1060i0) entered forwarding state
Oct 28 17:36:38 pmx-a pvedaemon[3638]: <root@pam> end task UPID:pmx-a:000AA92C:0090BB11:672003E5:qmstart:1060:root@pam: OK
Oct 28 17:36:38 pmx-a pvedaemon[698846]: starting vnc proxy UPID:pmx-a:000AA9DE:0090BBB4:672003E6:vncproxy:1060:root@pam:
Oct 28 17:36:38 pmx-a pvedaemon[3637]: <root@pam> starting task UPID:pmx-a:000AA9DE:0090BBB4:672003E6:vncproxy:1060:root@pam:
Oct 28 17:36:38 pmx-a pveproxy[690266]: proxy detected vanished client connection
Oct 28 17:36:39 pmx-a pvedaemon[698849]: starting vnc proxy UPID:pmx-a:000AA9E1:0090BBBE:672003E7:vncproxy:1060:root@pam:
Oct 28 17:36:39 pmx-a pvedaemon[3637]: <root@pam> starting task UPID:pmx-a:000AA9E1:0090BBBE:672003E7:vncproxy:1060:root@pam:
Oct 28 17:36:48 pmx-a pvedaemon[698846]: connection timed out
Oct 28 17:36:48 pmx-a pvedaemon[3637]: <root@pam> end task UPID:pmx-a:000AA9DE:0090BBB4:672003E6:vncproxy:1060:root@pam: connection timed out
Oct 28 17:38:02 pmx-a pvedaemon[699528]: starting vnc proxy UPID:pmx-a:000AAC88:0090DC48:6720043A:vncproxy:1060:root@pam:
Oct 28 17:38:02 pmx-a pvedaemon[3638]: <root@pam> starting task UPID:pmx-a:000AAC88:0090DC48:6720043A:vncproxy:1060:root@pam:
-- Boot 46250a42d2ce4c3f926b713a17d62f61 --
Oct 28 17:49:57 pmx-a kernel: Linux version 6.8.12-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) ()
Oct 28 17:49:57 pmx-a kernel: Command line: BOOT_IMAGE=/vmlinuz-6.8.12-2-pve root=ZFS=/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet

Any idea what is happening ? Any idea where I should look for more info ?

Thanks for your help,
 
I try to install that VM with a pretty large drive... 325G. Local ZFS store is big enough for that though (2TB ; 455G used) ...
 
Ok... So just tried a third time and got another crash.

This time, I reduced the hard drive from 350 to 275.
I also changed the cpu from v2 with AES to Host.

The difference is that node B did not crashed and only the A did.

I think that it is because my HA storage was not optimal when A crashed and that B ended up without access to its HA Storage. That is why it would crash itself right after A.

I keep searching how on earth can the install of a Red Hat VM in Proxmox can crash an entire host but really, any help on how to debug / diagnose this would be appreciated...
 
Ok... Got it running now...

It seems to be a bug specific to RedHat and to distributions related to it (CentOS, Alma, ...). In the forum, there are people experiencing problems with it, despite not ending in Proxmox's crashes.

First thing they used is CPU type to HOST. I already had that.
I also found notes about ACPI and NUMA. ACPI was enabled and Numa was not. I disabled ACPI and enabled Numa and it allowed me install the ISO without any crash.

So if you try to work with a RedHat distribution or one related to it, you may try these to help your case.
 
  • Like
Reactions: Lukas Moravek