Search results

  1. D

    Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    RE: pmx2 - Good catch - no that wasn't intentional, fixing it already. From a network standpoint 198.18.50-53.xxx can all ping each other, so the network pieces, yes were all operational. Based on the config however, it looks like pmx2 wasn't on ring2 correctly. That in and of itself shouldn't...
  2. D

    Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    Here's what the same event looked like from pmx4 (node 3) Oct 03 23:17:58 pmx4 corosync[6951]: [TOTEM ] Token has not been received in 4687 ms Oct 03 23:17:58 pmx4 corosync[6951]: [KNET ] link: host: 6 link: 0 is down Oct 03 23:17:58 pmx4 corosync[6951]: [KNET ] link: host: 6 link: 1 is...
  3. D

    Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    For reference, from a topology standpoint, pmx1/2/3/4/5 (nodes 6,5,4,3,2) sit in the same rack, whereas pmx6/7 (nodes 1,7) sit in another room, connected to different switches with shared infra between. root@pmx1:~# pveversion pve-manager/7.2-11/b76d3178 (running kernel: 5.15.39-3-pve)
  4. D

    Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    And the conf file. logging { debug: off to_syslog: yes } nodelist { node { name: pmx1 nodeid: 6 quorum_votes: 1 ring0_addr: 10.4.5.101 ring1_addr: 198.18.50.101 ring2_addr: 198.18.51.101 ring3_addr: 198.18.53.101 } node { name: pmx2 nodeid: 5...
  5. D

    Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    Here's logs from a (7) node cluster, this is from node 1 - I'm notice that there's nothing in the logs that explicitly say "hey, we've failed, I'm rebooting" so I hope this makes sense to you @fabian . I read this as "lost 0, lost 1, 2 is fine, we shuffle a bit to make 2 happy, then pull the...
  6. D

    Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    Will dig them out tonight, thanks. From an operational standpoint, is there any way to tweak the behavior of fencing? For example, this cluster has CEPH, and as long as CEPH is happy, I'm fine with all the VM's being shut down, but by all means, don't ()*@#$ reboot!!! It's easily 20 minutes...
  7. D

    Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

    Someone please explain to me why the loss of a single ring should force the entire cluster (9 hosts) to reboot? Topology - isn't 4 rings enough?? ring0_addr: 10.4.5.0/24 -- eth0/bond0 - switch1 (1ge) ring1_addr: 198.18.50.0/24 -- eth1/bond1 - switch2 (1ge) ring2_addr...
  8. D

    [SOLVED] Making zfs root mirror bootable (uefi)

    This solved my problem as well - thank you. Somewhere there should be a short "Admin HowTo" list, because this would be part of a document labeled: "How to replace a ZFS boot disk in a mirror set for Proxmox" (To be fair - this : https://pve.proxmox.com/pve-docs/chapter-sysadmin.html - is a...
  9. D

    Zabbix template

    How are people monitoring CEPH on their proxmox clusters? The default "Ceph by Zabbix Agent 2" quickly goes "unsupported" when added to a PMX host.
  10. D

    [SOLVED] Ceph - Schedule deep scrubs to prevent service degradation

    Fantastic work, thanks for sharing... favorited this one. (honestly, this should be a set of options in the PMX Ceph admin page.)
  11. D

    [SOLVED] Some LXC CT not starting after 7.0 update

    Happy to help! Glad it worked for you too! :)
  12. D

    PMX7.0 - HA - preventing entire cluster reboot

    Having read all the other threads (including : https://forum.proxmox.com/threads/pve-5-4-11-corosync-3-x-major-issues.56124/page-11#post-269235) , wanted to add - I'm running (4) different "rings" for corosync, spread across (4) different physical interfaces, and (2) different switches...
  13. D

    PMX7.0 - HA - preventing entire cluster reboot

    pve-manager/7.0-11/63d82f4e (running kernel: 5.11.22-5-pve) - (5) node cluster, full HA setup, CEPH filesystem How do I prevent HA from rebooting the entire cluster? 20:05:39 up 22 min, 2 users, load average: 6.58, 6.91, 5.18 20:05:39 up 22 min, 1 user, load average: 4.34, 6.79, 6.23...
  14. D

    CEPH multiple MDS on the same node

    Fair question, I'd like to see an answer as well - given pretty solid test data showing significant advantages to multiple MDS's (like this : https://croit.io/blog/ceph-performance-test-and-optimization) I'd love to see support for more than one MDS per server.
  15. D

    Best way to access CephFS from within VM (high perf)

    I am seeing an issue however, with CEPHFS performance in VM's, when one of the "mounted" IP's is down, for example: 198.18.53.101,198.18.53.102,198.18.53.103,198.18.53.104,198.18.53.105:/ /mnt/pve/cephfs when .103 was offline for a while today (crashed) VM's using things mounted in that path...
  16. D

    [SOLVED] pveceph osd destroy is not cleaning device properly

    Just wanted to be a big +1 for this command - I've been doing it the hard way any time a drive fails, and was pleasantly surprised to find it cleaned up all the pv/vg's correctly, even in the case of DB on NVME. 10/10 will use again. :)
  17. D

    Ceph 16.2.6 - CEPHFS failed after upgrade from 16.2.5

    TL;DR - Upgrade from 16.2.5 to 16.2.6 - CEPHFS fails to start after upgrade, all MDS in "standby" - requires ceph fs compat <fs name> add_incompat 7 "mds uses inline data" to work again. Longer version : pve-manager/7.0-11/63d82f4e (running kernel: 5.11.22-5-pve) apt dist-upgraded, CEPH...
  18. D

    [SOLVED] Some LXC CT not starting after 7.0 update

    Ran into this exact issue this week, upgrading some older ubuntu-14-LTS containers, didn't realize rolling to 16 would kill them :( What I did to fix it: lxc mount $CTID chroot /var/lib/lxc/$CTID/rootfs apt update apt dist-upgrade do-release-upgrade ((( none found - had to do it by hand ))...
  19. D

    New Mobile App for Proxmox VE!

    Sweet - works great on a 7.x cluster.
  20. D

    [SOLVED] proxmox 7 / linux 5.11.22 issue with LSI 2008 controllers?

    As a datapoint: I just completed a new 5-node builld, with several sets of 92xx cards * AOC-USAS-L8i (Broadcom 1068E) * LSI 9207-8i (IBM M5110) * AOC-S2308L-L8E (LSI 9207-8i) * two other random LSI 92xx cards I tried the cards as they were, and cross-flashed them to v20 (as appropriate), and...