[SOLVED] On node crash, OSD is down but stays "IN" and all vm's on all nodes keep in error and unusable.

MarwaneAch

New Member
Apr 10, 2024
5
0
1
Hello,

I work for multiple clients and one of them wanted us to create a Proxmox cluster to assure them fault tolerance and a good hypervisor that's cost-efficient.

It's the first time we put a Proxmox cluster in Production environment for a client. We've only used single node proxmox. Client wanted to use physical HDD of the servers as storage, and we made a Ceph cluster.

When I try to test HA on failure of a full node, I shut the server down using IDRAC or shutdown command in shell. The quorum stays up, OSDs go down on the shut down node but they stay IN (at least 2 OSDs everytime stay in). In that state, the running VM/CT are migrated succsefully but arent running well. (vm icon is up but vm error on start in logs).

When i put the OSDs out manually, everything goes well again and goes up on my 2 remaining nodes.

I've searched everywhere for information about that but I don't see anything that can help me.

Feel free to ask me for more information in order to help.

I have everything on the latest version available today.

Nodes : 3
mon and mgr : 3
mon_osd_down_out_interval : 120
OSDS : 8 on node 1 and 3. 7 on node 2.
Problem appears on all nodes when I shut them down (one by one to keep quorum.)
Ceph public network on management network
Ceph network on a dedicated vlan
But I only have 1 sfp+ configured right now where all the network goes through.

I hope it's not a foggy explanation.

Thank you !
 
But I only have 1 sfp+ configured right now where all the network goes through.
Is the cluster breaking apart after a node outages? Why haven't you adhered to the requirements for the network with 4 dedicated network links and configuration?

vm icon is up but vm error on start in logs
What is the actual error?
 
  • Like
Reactions: MarwaneAch
Is the cluster breaking apart after a node outages? Why haven't you adhered to the requirements for the network with 4 dedicated network links and configuration?
Actually, it seems the cluster doesn't really break apart but don't work properly. And yes, after only 1 node outage.
I plan to add another sfp+ but only at the end of the year, we have had an issue on a switch. we will have 2 sfp+ with bonding. Only way we can do it and respect the client's budget. They know the deal.

Find in my first attachment the status of all my PVEs. I only shat down the PVE03. and this happens.
Second attachement you see that even though it's showing "?" in PVE01, I can access the shell. => The running VM's on both PVE01 and PVE02 are receiving pings, but I cant connect to them with noVNC and I have errors when trying to change a setting on them.
Third attachment you find the error I was talking about. It's meaningless I think to resolve the issue but still.
Fourth and fifth are pve and ceph status
And sixth is the OSD problem I'm trying to show.

=> You can see the 2 last OSDs of PVE03 are still "IN" even tho all others are out and all of them are DOWN. I think it's the problem because when I put them manually "OUT", the two other nodes start working perfectly.

Tested this on PVE01, having exact same problem with 2 OSDS. If I put those OSDs OUT manually BEFORE shuttind down the node, it's 2 other OSDs that bug the same way.

Here my ceph configuration in last attachment. Config when all nodes are UP.

(We are currently using no-subscription repository, may change later if possible).

Thank you very much.
 

Attachments

  • pve state.png
    pve state.png
    25 KB · Views: 3
  • still access to shell.png
    still access to shell.png
    61.8 KB · Views: 3
  • Console error.png
    Console error.png
    8.7 KB · Views: 2
  • pvecm status.png
    pvecm status.png
    28 KB · Views: 2
  • ceph status.png
    ceph status.png
    28.4 KB · Views: 2
  • OSD state after 10 minutes.png
    OSD state after 10 minutes.png
    81.8 KB · Views: 2
  • config.png
    config.png
    17 KB · Views: 3
Hi, after talking with the Proxmox support team, I managed to resolve the issue.

I've had set my CEPH pool as 2/2 replication. it caused the error because there were only 2 replication (first 2) and it was telling ceph : if there are not at least 2 replication available, it's critical. (the scond 2). So it made it bug.

I could've used 2/1 instead but it was highly unrecommended because there could be data loss. 3/2 is the recommended for 3 nodes. I tested and it worked instantly as it should have.

There might have been some logs, but I didn't understand them.

Thank you for your help anyway !