Recent content by surfrock66

  1. S

    Help repairing an EFI issue on a btrfs root mirror?

    One of my promxox 8.3 hosts got updates and rebooted and never woke up (2 other servers, one which is identical hardware, came up fine). The dell bios on the R720 shows this: This system has it's root as a btrfs mirror on 2 drives, and should be efi. I think I need to boot the installation...
  2. S

    Are reboots logged before, or after, a reboot?

    Frustrating, as the idrac logs expose no smoking gun; I actually got a new backplane thinking it was the issue and it didn't reduce that message. There's a loose correlation with the drive reset message, but it feels like it doesn't make sense to cause an unexpected reboot. Nothing is exposed...
  3. S

    Are reboots logged before, or after, a reboot?

    I'm troubleshooting random host reboots which have been causing me huge headaches. I see them in the logs, and nothing appears to precede them. Mar 17 17:06:28 sr66-prox-03 pveproxy[348253]: Clearing outdated entries from certificate cache Mar 17 17:07:10 sr66-prox-03 snmpd[4026]...
  4. S

    VM's occasionally hang, is the following cronjob to reset vm's a good idea?

    No ceph in the environment; these are actually iscsi luns from a truenas scale box. They're pretty rock solid, and I never had an issue when I was dealing with a cluster of 3 identical CPU hypervisors. The other thing that's weird is it's not all at the same time; basically over the next 24...
  5. S

    VM's occasionally hang, is the following cronjob to reset vm's a good idea?

    1) Sample for a VM on the host in question (my Collabora server) root@sr66-prox-03:~# qm config 116 agent: 1,fstrim_cloned_disks=1 boot: order=ide2;scsi0 cores: 4 cpu: x86-64-v2-AES hotplug: disk,network,usb,memory,cpu ide2: none,media=cdrom memory: 4096 meta...
  6. S

    VM's occasionally hang, is the following cronjob to reset vm's a good idea?

    I'd been down this troubleshooting path before, but the answer was "expect unexpected stability on clusters with vastly different hardware." 1) 2 are Dell Poweredge R720's with 2x Xeon X5687's, one is a Dell Poweredge R6525 EPYC 7252. As budget allows this is to be the standard for replacing...
  7. S

    VM's occasionally hang, is the following cronjob to reset vm's a good idea?

    I am having an issue where I have 3 hypervisors in a cluster, one is substantially different than the others (totally different CPU generation). Eventually I will replace them all, but that takes time. Until then, I have the CPU type set to the lowest compatible version, and for the most part...
  8. S

    Proxmox Datacenter Manager - First Alpha Release

    I like the role of this, but it still seems like most functions exist on the hosts, not at the manager, and I'm wondering if the vision is to move things to PDCM as things get more robust. Cluster-wide load balancing is still something I relegate to Prox-LB, and I'd love for that to be native...
  9. S

    [Solved] Need help recovering disk of a VM

    Here's what I ended up doing: 1) Took a snapshot of the vm disk @now. 2) ZFS Sent the disk to the other host 3) Did a rescan of resources So: Host 1: zfs snapshot prox-zpool-01/vm-101-disk-1@now zfs send -Rpv prox-zpool-01/vm-101-disk-1@now | ssh -o BatchMode=yes root@10.2.10.30 zfs recv -Fv...
  10. S

    [Solved] Need help recovering disk of a VM

    I have a VM in a hardlocked condition. This VM had 2 disks on NAS storage, and was set to migrate in HA. As a short term fix, I moved one of those 2 disks to the local ZFS storage on the node, but I did NOT configure a replication job (oversight). A power event caused the node to drop, and...
  11. S

    Tips for diagnosing the cause of a host reboot?

    I want to further relay some discovery here as I think I have uncovered the root cause, however I still think there is an issue. We were having reboots every 3 days or less, however after discovering and resolving the issue, I have had over 8 days of uptime. This system has 2 2.5G NICS...
  12. S

    Tips for diagnosing the cause of a host reboot?

    I had another unexpected reboot overnight and at this time I have to remove this node from my environment; that being said I don't think I have a hardware issue, but believe this is an OS-initiated reboot and I don't know where from. I thought it could be something from the iDRAC but the logs...
  13. S

    Tips for diagnosing the cause of a host reboot?

    I had another unexpected reboot today, but I don't see the cause. Fencing dealt with my guests. Below is the start of irregular log entries; idrac logged no faults with storage or anything. Nov 27 19:33:50 sr66-prox-03 kernel: sd 0:0:19:0: [sda] tag#974 BRCM Debug mfi stat 0x2d, data len...
  14. S

    Tips for diagnosing the cause of a host reboot?

    I took the opportunity to upgrade to 8.3.0 as well, in case I'm hitting a bug.
  15. S

    Tips for diagnosing the cause of a host reboot?

    I got the notification from another node that it was trying to fence at 7:57, meaning likely after failure per the timestamps.