no migration if ceph network fails

silvered.dragon

Renowned Member
Nov 4, 2015
125
5
83
hi my firends,
I have two networks in a 3 node proxmox 4.4 cluster.
The first one 192.168.25.0 is for proxmox cluster comunication(1Gb/s) and the second one 10.0.2.0 is for ceph network(10Gb/s).
If the 192.168.25.0 network of a node goes down, then all VMs in that node will migrate to the other nodes automatically. Problems comes if the 10.0.2.0 network of a node goes down cause I cannot access the VMs of that node and they will not migrate to the other nodes(they simply are showed as "running" but obviously I cannot access them cause ceph connectivity is broken on that node).
What am I doing wrong?
many thanks
 
I also had this problem whilst I was testing Proxmox HA:
https://forum.proxmox.com/threads/3-node-ceph-cluster-no-failover-on-ceph-nic-unplug.34050/

I never found a solution. Are you using a switch for the Ceph network or directly connected over a MESH? I was using an HP switch, all pings were fine. I went over to Xenserver because of this problem but now the 'powers that be' have sent me back to Proxmox and am about to start setting it all up again. I was only here on the forum to check for updates...saw your post and thought 'oh sh...' ;)

H
 
I have a mesh network with infiniband 40G cards, but I have tested even with a simple ethernet 1GB/s switch with same results. I think this is a fencing problem. any solution here?? :(
 
Damn! I was using a switch and thought I'd either need to use mesh, or get some decent NICs to get it to work, but I see you've already done that, so that wasn't my problem. I've got 4 new Fujitsu servers on order and I'll have to set up Proxmox/Ceph again so I'll report back when that happens (2 weeks at least though).
The funny thing is that following the DRBD exit this is now the default/supported configuration for Proxmox HA - it's very odd that it seems to be only us that have had this problem - I couldn't find anyone else reporting it.

Plsssss report back if yu isolate the issue...

Tks/H
 
maybe the only way is to use same network for ceph and pve cluster comunication, but I think that this will be a bad choiche for performance and other issues.
 
@udo : Tks, but I recall that I tried that and of course it solved the problem of one NIC/Ethernet cable breaking but surely that's not the way it's supposed to be? If a VM in an HA cluster loses access to it's disk replication it should either remain unaffected and keep running (since one replicated copy of the disk/OSD is on the same machine) or it should failover to another pool member who has a replicated copy of the disk?
In my case neither happened. PVE either shut the VM down completely or the VM continued to run (as a Linux OS does) without being able to access the disk. I deemed this a huge hole in an HA system.

I can't debug further since I've since dismantled the system, maybe this is somehow fixed in PVE 5.0. I'll see in a couple of weeks

H