Recent content by dlasher

  1. Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    Just to make sure I understand this correctly: If I remove all the HA-configured LXC/KVM settings (I have DNS servers, video recorders, etc) and make them stand alone no-failover configs, it won't fence if Corosync gets unhappy? (That doesn't seem to ring true to me in a shared storage world.)
  2. Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    Thanks, gave it some thought, and changed the priorities a bit - we'll see if it does better than it has in the past. (Also has me thinking about things that could lower the latency between nodes, like MTU on the ring interfaces) It would be nice to gather raw data on keepalives across all...
  3. Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    Yeah, feels like only ceph replication has saved me from the heavy hand of rebooting. :( What would you suggest?
  4. Proxmox cluster reboots on network loss?

    Sorry to necro this thread, but it's one of *many* that come up with this title, and it's directly to the core issue. Proxmox needs to have an configurable option for behavior on Fencing. Rebooting an entire cluster upon the loss of a networking element is the sledgehammer, and we need the scalpel.
  5. Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    Thank you, fantastic information, already used it to clean things up a bit. Not if PMX thinks we need to reboot. So far, none of the failures have taken down CEPH, it's pmx/HA that gets offended. (Ironic, because corosync/totem has (4) rings, and CEPH sits on a single vlan, but I digress) The...
  6. Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    @fabian -- any thoughts on this question? I'd love to have more control over the failure steps/scenerios.
  7. Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    That makes sense, thank you, I didn't understand the corosync/totem/cluster-manager inter-op. (Is this written up anywhere I can digest?) I'll drop the timeouts back to default values. Since I know how to cause the meltdown, it will be easy to test the results of the change. How would you...
  8. Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    RE: pmx2 - Good catch - no that wasn't intentional, fixing it already. From a network standpoint 198.18.50-53.xxx can all ping each other, so the network pieces, yes were all operational. Based on the config however, it looks like pmx2 wasn't on ring2 correctly. That in and of itself shouldn't...
  9. Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    Here's what the same event looked like from pmx4 (node 3) Oct 03 23:17:58 pmx4 corosync[6951]: [TOTEM ] Token has not been received in 4687 ms Oct 03 23:17:58 pmx4 corosync[6951]: [KNET ] link: host: 6 link: 0 is down Oct 03 23:17:58 pmx4 corosync[6951]: [KNET ] link: host: 6 link: 1 is...
  10. Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    For reference, from a topology standpoint, pmx1/2/3/4/5 (nodes 6,5,4,3,2) sit in the same rack, whereas pmx6/7 (nodes 1,7) sit in another room, connected to different switches with shared infra between. root@pmx1:~# pveversion pve-manager/7.2-11/b76d3178 (running kernel: 5.15.39-3-pve)
  11. Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    And the conf file. logging { debug: off to_syslog: yes } nodelist { node { name: pmx1 nodeid: 6 quorum_votes: 1 ring0_addr: 10.4.5.101 ring1_addr: 198.18.50.101 ring2_addr: 198.18.51.101 ring3_addr: 198.18.53.101 } node { name: pmx2 nodeid: 5...
  12. Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    Here's logs from a (7) node cluster, this is from node 1 - I'm notice that there's nothing in the logs that explicitly say "hey, we've failed, I'm rebooting" so I hope this makes sense to you @fabian . I read this as "lost 0, lost 1, 2 is fine, we shuffle a bit to make 2 happy, then pull the...
  13. Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    Will dig them out tonight, thanks. From an operational standpoint, is there any way to tweak the behavior of fencing? For example, this cluster has CEPH, and as long as CEPH is happy, I'm fine with all the VM's being shut down, but by all means, don't ()*@#$ reboot!!! It's easily 20 minutes...
  14. Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    Someone please explain to me why the loss of a single ring should force the entire cluster (9 hosts) to reboot? Topology - isn't 4 rings enough?? ring0_addr: 10.4.5.0/24 -- eth0/bond0 - switch1 (1ge) ring1_addr: 198.18.50.0/24 -- eth1/bond1 - switch2 (1ge) ring2_addr...
  15. [SOLVED] Making zfs root mirror bootable (uefi)

    This solved my problem as well - thank you. Somewhere there should be a short "Admin HowTo" list, because this would be part of a document labeled: "How to replace a ZFS boot disk in a mirror set for Proxmox" (To be fair - this : https://pve.proxmox.com/pve-docs/chapter-sysadmin.html - is a...

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!