nodes inexplicably shut down in CEPH cluster in proxmox ve 8.1.4

Jalvarez

New Member
Aug 26, 2023
17
1
1
The nodes inexplicably turn off in CEPH cluster in proxmox ve 8.1.4, this happens occasionally when a backup is made of a VM whose storage is in a pool, suddenly the node is inhibited and I have to restart it to make it work again.
I have 3 identical nodes.

Please could someone help me with this problem?
 
Do you use HA?
How many corosync links do you have (corosync-cfgtool -s on each node)?
Which network is used for Ceph public and cluster traffic?
"The nodes inexplicably turn off" means they reboot or they get fully off?

Please elaborate the exact behavior that you are seeing.
 
It's a disaster, everything remains inhibited and cannot be used until the reset button is pressed.
 
This error appears when I migrate the virtual hard drives to the storage created in a ceph pool
 
There's little anyone can do to help you unless you provide at least the data I requested:

Do you use HA?
How many corosync links do you have (corosync-cfgtool -s on each node)?
Which network is used for Ceph public and cluster traffic?
"The nodes inexplicably turn off" means they reboot or they get fully off?
 
CEPH is very demanding on the network with massive network traffic multiplication, and from what I understand, corosync is very sensitive to network blips. What is the networking on the nodes? 1gb? 10g, 100gb? How many nics/node? It does sound like ceph could be saturating the network bandwidth and causing corosync to fail.
 
I have three network cards with one connection for the cluster network, another for the administrative network and the other for the VMs at the bridge level, the networks are:

ADMIN VLAN: 172.16.20.0/26
CLUSTER:192.168.60.0/28
VIRTUAL VM: 172.16.20.0/26
 
Ok, I have already done it, I am monitoring to see how the cluster behaves with this change in corosync
 
The same problems continue to occur, however I have noticed that when I group more than 5 VMs that have CEPH storage the node hangs
 
How can I perform migrations within the cluster gives me this error

root@172.16.20.10: Permission denied (publickey,password).
TASK ERROR: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=srvprox1' root@172.16.20.10 pvecm mtunnel -migration_network 192.168.60.10/28 -get_migration_ip' failed: exit code 255
 
  • Like
Reactions: jorel83
How can I perform migrations within the cluster gives me this error

root@172.16.20.10: Permission denied (publickey,password).
TASK ERROR: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=srvprox1' root@172.16.20.10 pvecm mtunnel -migration_network 192.168.60.10/28 -get_migration_ip' failed: exit code 255
I saw similar issues on my 8.1 fresh install and could not get any resolution to this problem and still 3 of 4 servers is down after this shit-show. Did you find a solution?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!