Search results

  1. K

    What's the actual, official way to reboot nodes in an HA-enabled cluster?

    Was that time span far enough into the past, Thomas?
  2. K

    What's the actual, official way to reboot nodes in an HA-enabled cluster?

    I ran into the 15000 character limit, so I'm attaching two log excerpts from 16:00 to 17:32 for pve-ha-lrm and pve-ha-crm, respectively. Is that far enough into the past?
  3. K

    What's the actual, official way to reboot nodes in an HA-enabled cluster?

    Oh, and of course this might be good to know: 16:25 is when proxmox3 went online after its planned reboot. 17:32 is when it went online after its unexpected reboot.
  4. K

    What's the actual, official way to reboot nodes in an HA-enabled cluster?

    Same time span, different node: Oct 6 17:29:00 proxmox0 systemd[1]: Starting Proxmox VE replication runner... Oct 6 17:29:01 proxmox0 systemd[1]: pvesr.service: Succeeded. Oct 6 17:29:01 proxmox0 systemd[1]: Started Proxmox VE replication runner. Oct 6 17:29:58 proxmox0 corosync[6718]...
  5. K

    What's the actual, official way to reboot nodes in an HA-enabled cluster?

    If it's simply the journal on the unexpectedly rebooting node between the end of the planned reboot and the end of the unexpected reboot, this is it: Oct 6 17:26:52 proxmox3 pmxcfs[5842]: [status] notice: received log Oct 6 17:27:00 proxmox3 systemd[1]: Starting Proxmox VE replication...
  6. K

    What's the actual, official way to reboot nodes in an HA-enabled cluster?

    Absolutely. How do I generate the output you need? From a node which got upgraded: proxmox-ve: 6.4-1 (running kernel: 5.4.128-1-pve) pve-manager: 6.4-13 (running version: 6.4-13/9f411e79) pve-kernel-5.4: 6.4-5 pve-kernel-helper: 6.4-5 pve-kernel-5.3: 6.1-6 pve-kernel-5.4.128-1-pve: 5.4.128-2...
  7. K

    What's the actual, official way to reboot nodes in an HA-enabled cluster?

    I'm asking because this old answer appears to be incorrect. We have a cluster with nine nodes. HA is enabled, with three rings. When it's time to upgrade packages, we do the following: Remove a node from all HA groups Wait for it to be fully evacuated apt update && apt upgrade Reboot and wait...
  8. K

    HA migration selection critera?

    Does active service count mean number of VPSes/containers or simply the number of active processes? Would you mind pointing me in the right direction to find where in the code this check happens? I'd like to study it and see if we can find a workaround in-house.
  9. K

    HA migration selection critera?

    I have been experimenting a bit with how a cluster behaves in terms of migrations when a node is removed from an HA group. As far as I can tell, when the removed node is evacuated by HA, the target node seems to be selected based on which remaining node has the lowest CPU usage at the time of...
  10. K

    Recommended way to reboot a node in a cluster with HA enabled?

    The nodes have four SFP+ 10Gb links, arranged into two LACP bonds plugged into two switches, and they also have a Gigabit Ethernet link each which goes to a third switch. We have two corosync networks; ring0 goes over one of the 10Gb bonds and is not dedicated, but ring1 goes over the GbE link...
  11. K

    Recommended way to reboot a node in a cluster with HA enabled?

    Apr 22 23:32:25 node6 corosync[4057]: [MAIN ] Completed service synchronization, ready to provide service. Apr 23 09:51:38 node6 corosync[4057]: [TOTEM ] A new membership (1.176) was formed. Members left: 4 Apr 23 09:51:38 node6 corosync[4057]: [TOTEM ] A new membership (1.176) was...
  12. K

    Recommended way to reboot a node in a cluster with HA enabled?

    Apr 22 23:32:25 node5 corosync[2241]: [MAIN ] Completed service synchronization, ready to provide service. Apr 23 09:50:59 node5 corosync[2241]: [TOTEM ] Retransmit List: dcf8c Apr 23 09:50:59 node5 corosync[2241]: [TOTEM ] Retransmit List: dcf8d Apr 23 09:50:59 node5 corosync[2241]...
  13. K

    Recommended way to reboot a node in a cluster with HA enabled?

    Apr 22 19:13:39 node3 corosync[4140]: [MAIN ] Completed service synchronization, ready to provide service. Apr 22 21:46:19 node3 corosync[4140]: [TOTEM ] Retransmit List: 377fa Apr 22 23:28:51 node3 corosync[4140]: [KNET ] link: host: 1 link: 0 is down Apr 22 23:28:51 node3...
  14. K

    Recommended way to reboot a node in a cluster with HA enabled?

    The cluster in question has seven nodes. Each node has one vote. They were all seemingly in quorum before one node was rebooted and the rest decided to take a dive. Here are the corosync logs for each of the nodes around the incident. That gap between 09:52 and 09:55 is when everything rebooted...
  15. K

    Recommended way to reboot a node in a cluster with HA enabled?

    Very occasionally when we reboot a node, the whole cluster reboots. Is there a recommended shutdown procedure we're missing? Is there a way to tell corosync “this node will go offline for a while now, so please don't panic”?
  16. K

    Clear HA migration queue?

    Thank you. To be honest, I sort of assumed that this was known and intentional for some reason. Yes, if this behaviour were changed it would change a lot. Again, thank you.
  17. K

    Clear HA migration queue?

    About 500 or so. The cluster contains seven nodes, but most of the VMs are kept on five. We can go down to four in an emergency without disruption of service, if things are kept balanced. Something I've spent quite a few code-hours trying to automate, and it mostly works fine, but there are a...
  18. K

    Clear HA migration queue?

    Is there a way to view this migration queue, then? Besides fetching a list of all the services with state migrate, I mean, which doesn't show any additional information such as destination. Where is the queue stored internally? Well … the last time it happened, we decided to do just that, and...
  19. K

    Clear HA migration queue?

    Let's say you have a cluster with HA enabled. If you have an “oh, crap” moment and accidentally trigger a mass migrate of a large number of virtual machines, placing them all in HA state migrate and queued for migration, is there a way to clear this queue so they don't get migrated? If there's...
  20. K

    Correct permissions for HA administration?

    Sorry if this constitutes necroposting, but I just wanted to thank you for listening to me and responding with a patch so quickly. :)

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!