On node crash, OSD is down but stays "IN" and all vm's on all nodes keep in error and unusable.

MarwaneAch

New Member
Apr 10, 2024
2
0
1
Hello,

I work for multiple clients and one of them wanted us to create a Proxmox cluster to assure them fault tolerance and a good hypervisor that's cost-efficient.

It's the first time we put a Proxmox cluster in Production environment for a client. We've only used single node proxmox. Client wanted to use physical HDD of the servers as storage, and we made a Ceph cluster.

When I try to test HA on failure of a full node, I shut the server down using IDRAC or shutdown command in shell. The quorum stays up, OSDs go down on the shut down node but they stay IN (at least 2 OSDs everytime stay in). In that state, the running VM/CT are migrated succsefully but arent running well. (vm icon is up but vm error on start in logs).

When i put the OSDs out manually, everything goes well again and goes up on my 2 remaining nodes.

I've searched everywhere for information about that but I don't see anything that can help me.

Feel free to ask me for more information in order to help.

I have everything on the latest version available today.

Nodes : 3
mon and mgr : 3
mon_osd_down_out_interval : 120
OSDS : 8 on node 1 and 3. 7 on node 2.
Problem appears on all nodes when I shut them down (one by one to keep quorum.)
Ceph public network on management network
Ceph network on a dedicated vlan
But I only have 1 sfp+ configured right now where all the network goes through.

I hope it's not a foggy explanation.

Thank you !
 
But I only have 1 sfp+ configured right now where all the network goes through.
Is the cluster breaking apart after a node outages? Why haven't you adhered to the requirements for the network with 4 dedicated network links and configuration?

vm icon is up but vm error on start in logs
What is the actual error?
 
  • Like
Reactions: MarwaneAch
Is the cluster breaking apart after a node outages? Why haven't you adhered to the requirements for the network with 4 dedicated network links and configuration?
Actually, it seems the cluster doesn't really break apart but don't work properly. And yes, after only 1 node outage.
I plan to add another sfp+ but only at the end of the year, we have had an issue on a switch. we will have 2 sfp+ with bonding. Only way we can do it and respect the client's budget. They know the deal.

Find in my first attachment the status of all my PVEs. I only shat down the PVE03. and this happens.
Second attachement you see that even though it's showing "?" in PVE01, I can access the shell. => The running VM's on both PVE01 and PVE02 are receiving pings, but I cant connect to them with noVNC and I have errors when trying to change a setting on them.
Third attachment you find the error I was talking about. It's meaningless I think to resolve the issue but still.
Fourth and fifth are pve and ceph status
And sixth is the OSD problem I'm trying to show.

=> You can see the 2 last OSDs of PVE03 are still "IN" even tho all others are out and all of them are DOWN. I think it's the problem because when I put them manually "OUT", the two other nodes start working perfectly.

Tested this on PVE01, having exact same problem with 2 OSDS. If I put those OSDs OUT manually BEFORE shuttind down the node, it's 2 other OSDs that bug the same way.

Here my ceph configuration in last attachment. Config when all nodes are UP.

(We are currently using no-subscription repository, may change later if possible).

Thank you very much.
 

Attachments

  • pve state.png
    pve state.png
    25 KB · Views: 0
  • still access to shell.png
    still access to shell.png
    61.8 KB · Views: 0
  • Console error.png
    Console error.png
    8.7 KB · Views: 0
  • pvecm status.png
    pvecm status.png
    28 KB · Views: 0
  • ceph status.png
    ceph status.png
    28.4 KB · Views: 0
  • OSD state after 10 minutes.png
    OSD state after 10 minutes.png
    81.8 KB · Views: 0
  • config.png
    config.png
    17 KB · Views: 0

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!