Proxmox host no boot after reboot

gk_emmo

Member
Oct 24, 2020
15
4
23
39
Hi!

I have a 5 node Ceph cluster. One of the nodes dead today, and won't boot since. It is a Dell R740, with 10 SSD's, and 2 Nvme for journal and DB. It worked for weeks. I updated packages, and did a reboot. I have 2 kernels in Grub, 5.15.85-1 and 5.15.30-2 . Neither of these (even recovery) are booting. It hangs on a screen which I dont understand. I attach it. At this point it hangs for half a minute, and then reboots and starts over. In grub, only IOMMU is turned on, nothing else. It worked with it also for weeks so i dont beleive that was a trigger.

Is there anybody who has a clue, or ran into this issue before?

As another question, if this node ran 1 VM, which was not HA, but were hosted from Ceph, how can i migrate this VM to another working node? Migrate is not working.

Thank You in advance

Gabor
 

Attachments

  • 222111.png
    222111.png
    187.4 KB · Views: 19
Looks like a hardware problem. Try to boot from USB, memtest86 or something. If that fails too, swap memory modules and CPUs.
 
Looks like a hardware problem. Try to boot from USB, memtest86 or something. If that fails too, swap memory modules and CPUs.
I tried to rule this out, DELL's built in tests were ran fine. Idrac reports no issues with the boot SSD's which are in a ZFS RAID1.

Now i try to open the system discs via grub rescue, but ls gives me an incorrect dnode type error.

Thanks for the tip, i will try memtest to be sure.