[SOLVED] On node crash, OSD is down but stays "IN" and all vm's on all nodes keep in error and unusable.

MarwaneAch · Dec 15, 2024

Hello,

I work for multiple clients and one of them wanted us to create a Proxmox cluster to assure them fault tolerance and a good hypervisor that's cost-efficient.

It's the first time we put a Proxmox cluster in Production environment for a client. We've only used single node proxmox. Client wanted to use physical HDD of the servers as storage, and we made a Ceph cluster.

When I try to test HA on failure of a full node, I shut the server down using IDRAC or shutdown command in shell. The quorum stays up, OSDs go down on the shut down node but they stay IN (at least 2 OSDs everytime stay in). In that state, the running VM/CT are migrated succsefully but arent running well. (vm icon is up but vm error on start in logs).

When i put the OSDs out manually, everything goes well again and goes up on my 2 remaining nodes.

I've searched everywhere for information about that but I don't see anything that can help me.

Feel free to ask me for more information in order to help.

I have everything on the latest version available today.

Nodes : 3
mon and mgr : 3
mon_osd_down_out_interval : 120
OSDS : 8 on node 1 and 3. 7 on node 2.
Problem appears on all nodes when I shut them down (one by one to keep quorum.)
Ceph public network on management network
Ceph network on a dedicated vlan
But I only have 1 sfp+ configured right now where all the network goes through.

I hope it's not a foggy explanation.

Thank you !

LnxBil · Dec 16, 2024

MarwaneAch said:
But I only have 1 sfp+ configured right now where all the network goes through.

Is the cluster breaking apart after a node outages? Why haven't you adhered to the requirements for the network with 4 dedicated network links and configuration?

MarwaneAch said:
vm icon is up but vm error on start in logs

What is the actual error?

MarwaneAch · Dec 16, 2024

LnxBil said:
Is the cluster breaking apart after a node outages? Why haven't you adhered to the requirements for the network with 4 dedicated network links and configuration?

Actually, it seems the cluster doesn't really break apart but don't work properly. And yes, after only 1 node outage.
I plan to add another sfp+ but only at the end of the year, we have had an issue on a switch. we will have 2 sfp+ with bonding. Only way we can do it and respect the client's budget. They know the deal.

Find in my first attachment the status of all my PVEs. I only shat down the PVE03. and this happens.
Second attachement you see that even though it's showing "?" in PVE01, I can access the shell. => The running VM's on both PVE01 and PVE02 are receiving pings, but I cant connect to them with noVNC and I have errors when trying to change a setting on them.
Third attachment you find the error I was talking about. It's meaningless I think to resolve the issue but still.
Fourth and fifth are pve and ceph status
And sixth is the OSD problem I'm trying to show.

=> You can see the 2 last OSDs of PVE03 are still "IN" even tho all others are out and all of them are DOWN. I think it's the problem because when I put them manually "OUT", the two other nodes start working perfectly.

Tested this on PVE01, having exact same problem with 2 OSDS. If I put those OSDs OUT manually BEFORE shuttind down the node, it's 2 other OSDs that bug the same way.

Here my ceph configuration in last attachment. Config when all nodes are UP.

(We are currently using no-subscription repository, may change later if possible).

Thank you very much.

LnxBil · Jan 9, 2025

MarwaneAch said:
Only way we can do it and respect the client's budget. They know the deal.

I hate clients like those

Are there any relevant entries in the log files (default and ceph) on the nodes working and nodes not working?

MarwaneAch · Jan 9, 2025

Hi, after talking with the Proxmox support team, I managed to resolve the issue.

I've had set my CEPH pool as 2/2 replication. it caused the error because there were only 2 replication (first 2) and it was telling ceph : if there are not at least 2 replication available, it's critical. (the scond 2). So it made it bug.

I could've used 2/1 instead but it was highly unrecommended because there could be data loss. 3/2 is the recommended for 3 nodes. I tested and it worked instantly as it should have.

There might have been some logs, but I didn't understand them.

Thank you for your help anyway !

[SOLVED] On node crash, OSD is down but stays "IN" and all vm's on all nodes keep in error and unusable.

MarwaneAch

New Member

LnxBil

Distinguished Member

MarwaneAch

New Member

Attachments

LnxBil

Distinguished Member

MarwaneAch

New Member

We value your privacy