7 nodes, one down, two unknown

LumbermanSVO

New Member
Apr 25, 2021
3
0
1
I have a 7 node Proxmox/Ceph cluster that has been running fine for several months.

This morning I woke up and one node(Chen1) was showing as down, red X on the node list, but was still powered up. I can’t log into the web UI through it, I can’t SSH into it, and it won’t respond to pings. I rebooted it and it came back up and seemed to work fine for about 5 minutes, then went down again.

After rebooting that node, two others(Dak 1, Dak2) started showing “unknown” as their status. I can load the web UI, I can SSH in, and they DO respond to pings. The LXC’s assigned to those nodes were running, but also showed an “unknown” as their state. The HA state for those LXC's was "error"

I had to go to work, so I powered down the whole cluster for the day. Where do I start with troubleshooting this mess when I get home this afternoon?

Edit: All nodes are running version 6 and were last updated sometime last week.
 
Last edited:
Hello,

Have you checked the Syslog in /var/log/syslog.* path? or journalctl?

What is say the output of pvecm status?
 
So, last night when I got home I first did was boot just the one "Bad" node up and do a a DD copy of the boot drive just in case it was the drive failing. Next I shut that node off, then booted the whole cluster at the same time. Everything came up just fine, then after about 10 minutes the "Bad" node dropped offline again. I then plugged in a monitor and keyboard and there was an error on the screen, it couldn't mount the ceph volume, and had instructions for checking the status. So i rebooted the node and ran
Code:
systemctl status mnt-pve-cephfs.mount
as instructed in the error, but nothing looked off.

Because the error had to do with mounting the ceph volume I decided to add a monitor to that node, thinking maybe that might help it connect if things are a bit off? I dunno, seemed kinda logical. Either way, it has been 24 hours and and everything is fine. The two nodes that were "Unknown" have been fine since rebooting the cluster.

Interestingly, the Home Assistant VM would not boot after all this. I just restored a backup from a couple days ago and it is all good now.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!