Search results

  1. M

    HA Design

    I had this scenario happen during testing. If the network cables are removed from the interface on the 10.201.0.0 subnet, on any given node and corosync is running on a separate network then the VMs do not migrate. While the scenario should be extremely rare as the 10.201.0.0 network is a bonded...
  2. M

    HA Design

    I currently have a 4 node HCI cluster that's working quite well. It will be expanding to 8 nodes total and be used for critical services. All of the testing was satisfactory and management was duly impressed. I am reinstalling the cluster from scratch in order to ensure none of the testing bits...
  3. M

    [SOLVED] Ceph Slow Ops Reported

    The slow ops and crashes went completely away when I unmounted the no longer existing CephFS mount. I don't know enough about Ceph to comment on why it happened that way. I didn't unmount the filesystem prior to destroying CephFS so I'm blaming it on there being a nut loose behind the keyboard...
  4. M

    VM Not Migrating in HA

    So if I need the VMs to migrate when a specific interface is down then corosync should run on that interface?
  5. M

    [SOLVED] Ceph Slow Ops Reported

    CephFS was still mounted by the OS, once it was unmounted the errors stopped. I can't explain that other than it being the only thing that changed in the environment. Yes I am aware the 169.254 address range is reserved for link-local in the RFC. However none of these machines have any...
  6. M

    VM Not Migrating in HA

    So I ran into a weird situation today. Something went haywire on the switch stack we are using for bonding taking out one node in a four node HA cluster. The node shows links as up but traffic goes nowhere. The three other nodes are fine. We are using a separate network for corosync so as far as...
  7. M

    [SOLVED] Ceph Slow Ops Reported

    Sorry I didn't respond sooner. The errors are no longer happening, it appeared to be the result of my creating, then destroying CephFS. Even 'df' was hanging. In the end it was my own mistake. Once corrected everything is up, running, and stable. We are testing all of the aspects of HA and...
  8. M

    [SOLVED] Ceph Slow Ops Reported

    New error, or one I hadn't see before: Sep 28 14:58:30 hhsc0002 ceph-crash[1910]: WARNING:__main__:post /var/lib/ceph/crash/2020-09-28_19:41:51.826711Z_f86ab357-b41f-4cb4-b1ae-04eab176e905 as client.crash.hhsc0002 failed: [errno 2] error connecting to the cluster Sep 28 14:58:30 hhsc0002...
  9. M

    [SOLVED] Ceph Slow Ops Reported

    Sorry about that: https://pastebin.com/Y5RT148v Don't know why it disappeared. That was my assumption but at this point I'm willing to poke anything with a stick.
  10. M

    [SOLVED] Ceph Slow Ops Reported

    I have an brand new, completely clean, 6.2 installation that was completed on Friday. No VMs yet, 4 identical machines (hhcs000[1-4]). And on two of the three monitors I am getting a lot of these in the logs on hhcs0002 and hhcs0003, while hhcs0001 seems fine: mon.hhsc0002@1(peon) e5...