Good day all,
I am lab testing HA cluster with CEPH on PVE 4.4.13.
BUT….when I unplug the CEPH NIC cable on a machine/node that is hosting either of the two running VMs, it does not migrate.
The VM on the unplugged machine either: still runs (I can ssh in) but it loses access to it’s disk - even if one of the osds in the pool is the local disk on the same machine; or PVECM just shuts it down; But in either scenario it does not migrate and restart. The VM does not failover and so the system (2 VMs taking to each other, reading and writing to shared storage) fails. I have tried using CEPH pools size/size_min 2/1 and 3/1
I guess I could workaround by installing another NIC and bonding them for CEPH (will probably do this on an eventual production installation anyway, along with mirrored drives) but my questions are:
Ceph logs report: osd 2 is down blah blah blah).
Kernel log says: igb 0000:03:00.0 eth1: igb: eth1 NIC Link is Down
Daemon log says: heartbeat_check: no reply from 0x55a123fc5590 osd.2 since back 2017-04-07 09:16:32.291905 front 2017-04-07 09:16:32.291905 (cutoff 2017-04-07 09:17:26.496194)
But OFC I know all that - I'm holding the unplugged cable in my hand!
Tks in advance
HP
I am lab testing HA cluster with CEPH on PVE 4.4.13.
- 3 machines, 2 disks each machine, 1 each for system/boot and 1 each for CEPH cluster.
2 VMs running (Linux) which I am migrating around during testing. - My basic test environment is that VM 1 (Manjaro Linux) runs a bash on startup which writes a record every 5 secs of the current date/time to a mysql db on VM 2 (Ubuntu 16.04). By selecting the last records in the db table i can see if both machines are running, and at what time they stopped, and how long they took to restart.
- Each machine has 2 x 1Gb NICs. One for PVE/PVECM and bridged for VMs. The 2nd NIC is dedicated to CEPH
- Setup as per wiki - no problems - runs fine
BUT….when I unplug the CEPH NIC cable on a machine/node that is hosting either of the two running VMs, it does not migrate.
The VM on the unplugged machine either: still runs (I can ssh in) but it loses access to it’s disk - even if one of the osds in the pool is the local disk on the same machine; or PVECM just shuts it down; But in either scenario it does not migrate and restart. The VM does not failover and so the system (2 VMs taking to each other, reading and writing to shared storage) fails. I have tried using CEPH pools size/size_min 2/1 and 3/1
I guess I could workaround by installing another NIC and bonding them for CEPH (will probably do this on an eventual production installation anyway, along with mirrored drives) but my questions are:
- Is this the expected behaviour? Surely not? One NIC failing on a node that is hosting a VM breaks that VM?
- Where are the logs that I should look at showing what PVECM is doing, or trying to do at this time?
Ceph logs report: osd 2 is down blah blah blah).
Kernel log says: igb 0000:03:00.0 eth1: igb: eth1 NIC Link is Down
Daemon log says: heartbeat_check: no reply from 0x55a123fc5590 osd.2 since back 2017-04-07 09:16:32.291905 front 2017-04-07 09:16:32.291905 (cutoff 2017-04-07 09:17:26.496194)
But OFC I know all that - I'm holding the unplugged cable in my hand!
Tks in advance
HP