Proxmox host no boot after reboot

gk_emmo

Member
Oct 24, 2020
13
2
8
38
Hi!

I have a 5 node Ceph cluster. One of the nodes dead today, and won't boot since. It is a Dell R740, with 10 SSD's, and 2 Nvme for journal and DB. It worked for weeks. I updated packages, and did a reboot. I have 2 kernels in Grub, 5.15.85-1 and 5.15.30-2 . Neither of these (even recovery) are booting. It hangs on a screen which I dont understand. I attach it. At this point it hangs for half a minute, and then reboots and starts over. In grub, only IOMMU is turned on, nothing else. It worked with it also for weeks so i dont beleive that was a trigger.

Is there anybody who has a clue, or ran into this issue before?

As another question, if this node ran 1 VM, which was not HA, but were hosted from Ceph, how can i migrate this VM to another working node? Migrate is not working.

Thank You in advance

Gabor
 

Attachments

  • 222111.png
    222111.png
    187.4 KB · Views: 18
Looks like a hardware problem. Try to boot from USB, memtest86 or something. If that fails too, swap memory modules and CPUs.
 
Looks like a hardware problem. Try to boot from USB, memtest86 or something. If that fails too, swap memory modules and CPUs.
I tried to rule this out, DELL's built in tests were ran fine. Idrac reports no issues with the boot SSD's which are in a ZFS RAID1.

Now i try to open the system discs via grub rescue, but ls gives me an incorrect dnode type error.

Thanks for the tip, i will try memtest to be sure.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!