Trying to install a new VM crashes the entire Proxmox cluster

Heracles31

New Member
Aug 4, 2024
14
2
3
Hi,

Running a 2 nodes + 1 QDevice cluster here. Both nodes are running 8.2.7 (no subscription repo).

Code:
root@pmx-a:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.12-2-pve)
pve-manager: 8.2.7 (running version: 8.2.7/3e0176e6bb2ade3b)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-2
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.3
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.1
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.10
libpve-storage-perl: 8.2.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-4
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.2.0
pve-docs: 8.2.3
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.0.7
pve-firmware: 3.13-2
pve-ha-manager: 4.0.5
pve-i18n: 3.2.3
pve-qemu-kvm: 9.0.2-3
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.4
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

I tried to install QRadar Community Edition in a VM on the first node yesterday. ISO is downloaded from IBM and SHA256 checksum has been validated, so no corruption here
Code:
(sha256 cc8311deabc90762110d2e1ac978f2a67828077466b2900caa8566018a34d541)

During the install process (which starts by installing Red Hat), the entire Proxmox node (pmx-a) crashed and rebooted. If the second node took the charge for a moment, it too crashed a little bit after the first, taking down my entire environment. I had to manually re-sync my MariaDB cluster, my Starwind vSAN ended up degraded too, ... After rebuilding everything, I just re-tried today. Same result! Everything crashed again in the very same way.

1-How the deployment of a new VM in a Proxmox node can crashes that node ?

2-How that crash of a first Proxmox node in a cluster can crashes the other node ?

Both nodes are FC630s in the very same FX2S Dell system. Each has an FD332 storage node with a total of 16 drives. Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz and 192G of RAM each. They are far to be loaded, Data Center's Summary showing CPU usage at 3%, memory at 28% and Storage at 55. pvecm status also shows that before the crash and after the reboot, both nodes and the QDevice are online.

Corosync has triple redundancy, first link being a physically dedicated 1G network direct cable (no switch).

About the QDevice, it is reached over a site-to-site VPN. That VPN is managed by pfSense, itself running as VMs in each node. pfSense-A is active by default and pfSense-B will take over in case of failure. When pmx-a crashes, pfSense-A goes down with it. It will take a minimum of time for the failover to complete, so that may be why pmx-b is impacted ? That failover takes too long, pmx-b looses quorum and will rather crash itself instead of running ?

When looking for clues, I get this :
Code:
root@pmx-a:~# journalctl -p err -f
Oct 28 18:12:21 pmx-a pvedaemon[3645]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:12:41 pmx-a pvedaemon[3644]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:13:01 pmx-a pvedaemon[3643]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:13:21 pmx-a pvedaemon[3645]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:13:41 pmx-a pvedaemon[3644]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:14:01 pmx-a pvedaemon[3643]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:14:21 pmx-a pvedaemon[3644]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:14:41 pmx-a pvedaemon[3643]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:15:01 pmx-a pvedaemon[3644]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout
Oct 28 18:15:13 pmx-a pvedaemon[3643]: VM 1060 qmp command failed - VM 1060 qmp command 'guest-ping' failed - got timeout

Indeed, VM 1060 is the one I try to install (QRadar CE). The thing is, these lines are AFTER the crash.

Same if I look at the same info from node B. The errors are the consequences of the crash after the reboot. Nothing about what may have cause this in the first place.

When scrolling in journalctl for whatever happened, I get only this :
Code:
Oct 28 17:36:37 pmx-a pvedaemon[3638]: <root@pam> starting task UPID:pmx-a:000AA92C:0090BB11:672003E5:qmstart:1060:root@pam:
Oct 28 17:36:37 pmx-a pvedaemon[698668]: start VM 1060: UPID:pmx-a:000AA92C:0090BB11:672003E5:qmstart:1060:root@pam:
Oct 28 17:36:37 pmx-a systemd[1]: Started 1060.scope.
Oct 28 17:36:38 pmx-a kernel: tap1060i0: entered promiscuous mode
Oct 28 17:36:38 pmx-a kernel: Infra: port 3(fwpr1060p0) entered blocking state
Oct 28 17:36:38 pmx-a kernel: Infra: port 3(fwpr1060p0) entered disabled state
Oct 28 17:36:38 pmx-a kernel: fwpr1060p0: entered allmulticast mode
Oct 28 17:36:38 pmx-a kernel: fwpr1060p0: entered promiscuous mode
Oct 28 17:36:38 pmx-a kernel: Infra: port 3(fwpr1060p0) entered blocking state
Oct 28 17:36:38 pmx-a kernel: Infra: port 3(fwpr1060p0) entered forwarding state
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 1(fwln1060i0) entered blocking state
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 1(fwln1060i0) entered disabled state
Oct 28 17:36:38 pmx-a kernel: fwln1060i0: entered allmulticast mode
Oct 28 17:36:38 pmx-a kernel: fwln1060i0: entered promiscuous mode
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 1(fwln1060i0) entered blocking state
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 1(fwln1060i0) entered forwarding state
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 2(tap1060i0) entered blocking state
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 2(tap1060i0) entered disabled state
Oct 28 17:36:38 pmx-a kernel: tap1060i0: entered allmulticast mode
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 2(tap1060i0) entered blocking state
Oct 28 17:36:38 pmx-a kernel: fwbr1060i0: port 2(tap1060i0) entered forwarding state
Oct 28 17:36:38 pmx-a pvedaemon[3638]: <root@pam> end task UPID:pmx-a:000AA92C:0090BB11:672003E5:qmstart:1060:root@pam: OK
Oct 28 17:36:38 pmx-a pvedaemon[698846]: starting vnc proxy UPID:pmx-a:000AA9DE:0090BBB4:672003E6:vncproxy:1060:root@pam:
Oct 28 17:36:38 pmx-a pvedaemon[3637]: <root@pam> starting task UPID:pmx-a:000AA9DE:0090BBB4:672003E6:vncproxy:1060:root@pam:
Oct 28 17:36:38 pmx-a pveproxy[690266]: proxy detected vanished client connection
Oct 28 17:36:39 pmx-a pvedaemon[698849]: starting vnc proxy UPID:pmx-a:000AA9E1:0090BBBE:672003E7:vncproxy:1060:root@pam:
Oct 28 17:36:39 pmx-a pvedaemon[3637]: <root@pam> starting task UPID:pmx-a:000AA9E1:0090BBBE:672003E7:vncproxy:1060:root@pam:
Oct 28 17:36:48 pmx-a pvedaemon[698846]: connection timed out
Oct 28 17:36:48 pmx-a pvedaemon[3637]: <root@pam> end task UPID:pmx-a:000AA9DE:0090BBB4:672003E6:vncproxy:1060:root@pam: connection timed out
Oct 28 17:38:02 pmx-a pvedaemon[699528]: starting vnc proxy UPID:pmx-a:000AAC88:0090DC48:6720043A:vncproxy:1060:root@pam:
Oct 28 17:38:02 pmx-a pvedaemon[3638]: <root@pam> starting task UPID:pmx-a:000AAC88:0090DC48:6720043A:vncproxy:1060:root@pam:
-- Boot 46250a42d2ce4c3f926b713a17d62f61 --
Oct 28 17:49:57 pmx-a kernel: Linux version 6.8.12-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) ()
Oct 28 17:49:57 pmx-a kernel: Command line: BOOT_IMAGE=/vmlinuz-6.8.12-2-pve root=ZFS=/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet

Any idea what is happening ? Any idea where I should look for more info ?

Thanks for your help,
 
I try to install that VM with a pretty large drive... 325G. Local ZFS store is big enough for that though (2TB ; 455G used) ...
 
Ok... So just tried a third time and got another crash.

This time, I reduced the hard drive from 350 to 275.
I also changed the cpu from v2 with AES to Host.

The difference is that node B did not crashed and only the A did.

I think that it is because my HA storage was not optimal when A crashed and that B ended up without access to its HA Storage. That is why it would crash itself right after A.

I keep searching how on earth can the install of a Red Hat VM in Proxmox can crash an entire host but really, any help on how to debug / diagnose this would be appreciated...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!