no migration if ceph network fails

silvered.dragon

Renowned Member
Nov 4, 2015
123
4
83
hi my firends,
I have two networks in a 3 node proxmox 4.4 cluster.
The first one 192.168.25.0 is for proxmox cluster comunication(1Gb/s) and the second one 10.0.2.0 is for ceph network(10Gb/s).
If the 192.168.25.0 network of a node goes down, then all VMs in that node will migrate to the other nodes automatically. Problems comes if the 10.0.2.0 network of a node goes down cause I cannot access the VMs of that node and they will not migrate to the other nodes(they simply are showed as "running" but obviously I cannot access them cause ceph connectivity is broken on that node).
What am I doing wrong?
many thanks
 
I also had this problem whilst I was testing Proxmox HA:
https://forum.proxmox.com/threads/3-node-ceph-cluster-no-failover-on-ceph-nic-unplug.34050/

I never found a solution. Are you using a switch for the Ceph network or directly connected over a MESH? I was using an HP switch, all pings were fine. I went over to Xenserver because of this problem but now the 'powers that be' have sent me back to Proxmox and am about to start setting it all up again. I was only here on the forum to check for updates...saw your post and thought 'oh sh...' ;)

H
 
I have a mesh network with infiniband 40G cards, but I have tested even with a simple ethernet 1GB/s switch with same results. I think this is a fencing problem. any solution here?? :(
 
Damn! I was using a switch and thought I'd either need to use mesh, or get some decent NICs to get it to work, but I see you've already done that, so that wasn't my problem. I've got 4 new Fujitsu servers on order and I'll have to set up Proxmox/Ceph again so I'll report back when that happens (2 weeks at least though).
The funny thing is that following the DRBD exit this is now the default/supported configuration for Proxmox HA - it's very odd that it seems to be only us that have had this problem - I couldn't find anyone else reporting it.

Plsssss report back if yu isolate the issue...

Tks/H
 
maybe the only way is to use same network for ceph and pve cluster comunication, but I think that this will be a bad choiche for performance and other issues.
 
@udo : Tks, but I recall that I tried that and of course it solved the problem of one NIC/Ethernet cable breaking but surely that's not the way it's supposed to be? If a VM in an HA cluster loses access to it's disk replication it should either remain unaffected and keep running (since one replicated copy of the disk/OSD is on the same machine) or it should failover to another pool member who has a replicated copy of the disk?
In my case neither happened. PVE either shut the VM down completely or the VM continued to run (as a Linux OS does) without being able to access the disk. I deemed this a huge hole in an HA system.

I can't debug further since I've since dismantled the system, maybe this is somehow fixed in PVE 5.0. I'll see in a couple of weeks

H
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!