Debian VM randomly hangs on 'Loading Initial ramdisk' after node reboot/power on

mr.n0b0dy

New Member
Dec 2, 2024
2
0
1
Italy
github.com
Hello to everyone, this is my first post!
First of all let me thank you all because despite being a noob and working in a completely different field, thanks to the forum I learnt a lot.

Let me introduce my hardware setup: a machine hostname 00-lazarus with an Intel Pentium G3220 (Proxmox VE 8.8.2), another machine 01-ares with an Intel core i3-4150 (Proxmox VE 8.8.2). Both have 16GB RAM and 512GB SATA SSD and they are identical apart from what I've written above, because they are 2 old miniPCs a friend gifted to me and I just swapped an Intel core i3-4150 in one of the 2 miniPCs.
Despite the name I gave above, they are not part of the same cluster or any cluster because I like them independent and there is no gain for me in clustering them.

> Note: 00-lazarus was part of a "testing cluster" in the past but then I correctly detached it... idk if this can in someway help for what I am about to say.

Regarding the VMs in the nodes:
00-lazarus runs 2xDebian 12 and 01-ares 1xDebian 12 and Home-Assistant. Trying to debug the problem I tried to duplicate the 2 VMs on 00-lazarus, so that 00-lazarus now run 4 VMs (everything with 2GB RAM so that shouldn't be a problem). The CPU set for the 2xDebian VMs on 00-lazarus is x86-64-v2 and on 01-ares I kept that CPU.

As title says, I have a strange behavior that happens only sometimes when restarting 00-lazarus (and rarely (up to now only once) shutting down the node and then powering it up again). Proxmox correctly execute the task 'bulk shutdown of all VMs and containers', reboot itself and then correctly execute the task 'bulk startup of alla VMs and containers'. At least this is what the task logs says.
What happens is that randomly one of the 2 VMs hang on 'Loading Initial ramdisk'. From there just resetting the VM is enough and then it works as expected.

I then moved the 2 VMs on 01-ares doing a backup on a NAS via NFS and this problem never happened on 01-ares... I also duplicated the 2 VMs restoring the backup on 00-lazarus and the problem above extended also to the "new" VMs on 00-lazarus. Up to now always only one VM among all in a completely random manner get stuck at boot like I already said.

Unfortunately is extremely frequent that one of the 2 VMs on 00-lazarus doesn't work but is not systematic and sometimes they switch and the faulty one works while the other get stuck on that boot phase.

I did almost 150 reboots and shutdown/power-on on both nodes, both consequently (10-15 reboot and 5-6 shutdown one after the other) and with time dilation of some days to let the VMs run for a bit. 01-ares doesn't seem to suffer from this problem at all, 00-lazarus start correctly all VMs at boot once in a 30 trials.

I checked 00-lazarus journalctl for further details but only lines of correct tasks execution appears.

Anyway, if you need them, here is a small snippet from journalctl after reboot:
Bash:
Dec 02 09:51:50 00-lazarus systemd[1]: Starting pve-guests.service - PVE guests...
Dec 02 09:51:51 00-lazarus pve-guests[1075]: <root@pam> starting task UPID:00-lazarus:00000436:0000061F:674D7527:startall::root@pam:
Dec 02 09:51:51 00-lazarus pvesh[1075]: Starting VM 100
Dec 02 09:51:51 00-lazarus pve-guests[1078]: <root@pam> starting task UPID:00-lazarus:00000437:00000621:674D7527:qmstart:100:root@pam:
Dec 02 09:51:51 00-lazarus pve-guests[1079]: start VM 100: UPID:00-lazarus:00000437:00000621:674D7527:qmstart:100:root@pam:
Dec 02 09:51:52 00-lazarus systemd[1]: Created slice qemu.slice - Slice /qemu.
Dec 02 09:51:52 00-lazarus systemd[1]: Started 100.scope.
Dec 02 09:51:52 00-lazarus kernel: tap100i0: entered promiscuous mode
Dec 02 09:51:52 00-lazarus kernel: netbr: port 2(tap100i0) entered blocking state
Dec 02 09:51:52 00-lazarus kernel: netbr: port 2(tap100i0) entered disabled state
Dec 02 09:51:52 00-lazarus kernel: tap100i0: entered allmulticast mode
Dec 02 09:51:52 00-lazarus kernel: netbr: port 2(tap100i0) entered blocking state
Dec 02 09:51:52 00-lazarus kernel: netbr: port 2(tap100i0) entered forwarding state
Dec 02 09:51:53 00-lazarus pvesh[1075]: Waiting for 10 seconds (startup delay)
Dec 02 09:51:54 00-lazarus chronyd[848]: Selected source 85.199.214.99 (2.debian.pool.ntp.org)
Dec 02 09:51:54 00-lazarus chronyd[848]: System clock TAI offset set to 37 seconds
Dec 02 09:51:58 00-lazarus kernel: netfs: FS-Cache loaded
Dec 02 09:51:58 00-lazarus kernel: NFS: Registering the id_resolver key type
Dec 02 09:51:58 00-lazarus kernel: Key type id_resolver registered
Dec 02 09:51:58 00-lazarus kernel: Key type id_legacy registered
Dec 02 09:51:58 00-lazarus nfsrahead[1181]: setting /mnt/pve/DS920p readahead to 128
Dec 02 09:52:03 00-lazarus pvesh[1075]: Starting VM 101
Dec 02 09:52:03 00-lazarus pve-guests[1078]: <root@pam> starting task UPID:00-lazarus:000004A6:00000AD2:674D7533:qmstart:101:root@pam:
Dec 02 09:52:03 00-lazarus pve-guests[1190]: start VM 101: UPID:00-lazarus:000004A6:00000AD2:674D7533:qmstart:101:root@pam:
Dec 02 09:52:03 00-lazarus systemd[1]: Started 101.scope.
Dec 02 09:52:04 00-lazarus kernel: tap101i0: entered promiscuous mode
Dec 02 09:52:04 00-lazarus kernel: vpnbr: port 2(tap101i0) entered blocking state
Dec 02 09:52:04 00-lazarus kernel: vpnbr: port 2(tap101i0) entered disabled state
Dec 02 09:52:04 00-lazarus kernel: tap101i0: entered allmulticast mode
Dec 02 09:52:04 00-lazarus kernel: vpnbr: port 2(tap101i0) entered blocking state
Dec 02 09:52:04 00-lazarus kernel: vpnbr: port 2(tap101i0) entered forwarding state
Dec 02 09:52:05 00-lazarus pvesh[1075]: Waiting for 20 seconds (startup delay)
Dec 02 09:52:25 00-lazarus pve-guests[1075]: <root@pam> end task UPID:00-lazarus:00000436:0000061F:674D7527:startall::root@pam: OK
Dec 02 09:52:25 00-lazarus systemd[1]: Finished pve-guests.service - PVE guests.

The main problem here is not even this strange bug, but the fact that i can't find details or log about this because Proxmox says the VMs started and runs correctly (ie running qm status <VM-ID> the output is status: running).

I also tried to check the VMs journalctl but they have has no log at all of those faulty boot (I'm not an expert but probably systemctl is far from being started so journactl doesn't log anything).

I tried looking on internet, on the forum and to ask ChatGPT but it doesn't seem something already noted by other users.

I konw you would probably say: well, just reinstall Proxmox on the machine.
My reply: Of course I will, but first I would like to know if I can extrapolate some further info or data or log or anything before erasing everything.

This is pretty concerning to me because the 2 machines are pretty similar and in the beginning 00-lazarus didn't seem to have this problem, so I would like to know what is wrong with it.
I will buy a proper server with ECC RAM and server-grade components... one day... but for now I'm just a humble student that want to run a small homelab to avoid paying hundreds in cloud space here, services there and so on:rolleyes:.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!