Recent content by Nathan Stratton

  1. N

    Proxmox with 48 nodes

    2690v4 is actually 22 cores, and you're neglecting the power of the FANS, GPUs, Hard Drives, etc. But you're right, you need less than half as many, but the power consumption is not that much different in the two workloads. Also, when you factor in the price of the used E5-2690v4 and the new...
  2. N

    Proxmox with 48 nodes

    Yes, latency is the same for 40/10, but with dual ports and VLANs for traffic separation, I thought I would be ok. As for why, the hardware was reclaimed and that is what we have... You're right about Broadwell, but again, it's what we have, and I'm not so sure about the cost savings. We...
  3. N

    Proxmox with 48 nodes

    I know in the past that the recommended max number of nodes in a cluster was 32, but is this still the case? My boxes are all dual E5-2690v4 with dual 40 Gig Ethernet. I would like to have one cluster with 48 nodes, but is that a bad idea? Should I go two with 24 nodes?
  4. N

    bash sleep 10 is very inconsistent on VM

    I think you may be right, the .py script just tails the log and greps a line and shows the difference since last. Any idea why sending a line to syslog every 10 sec would be an issue, but writing to a file the same log line every 10 seconds would not? #!/usr/bin/env python3 import argparse, os...
  5. N

    bash sleep 10 is very inconsistent on VM

    I shared the script, its very simple, its just sleep 10 and a writing to syslog. The host nodes are all AMD EPYC 7763 64-Core Processor, 512G RAM, Linux virt1 6.14.11-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.11-2 VM is range from 4 cores 1 socket to 8 cores 1 socket, 6.8.0-79-generic #79-Ubuntu...
  6. N

    bash sleep 10 is very inconsistent on VM

    Running Proxmox 9.0.10 with Ubuntu 24.04 guest. I have been working with voipmonitor support staff who say my issues are because of my clock. I tested a simple script that SHOULD run every 10 seconds, but does not. This is on the weekend when my load is almost zero. Any ideas? #!/bin/bash while...
  7. N

    How do I remove duplicate unknown monitor, manager, and MDS?

    I tried to delete and recreate the service, but the ? unknown service came back when I created a new one for virt2. Any ideas on how to clean this up?
  8. N

    IOMMU 4 NVIDIA GPUs with NCCL

    I have a VM with four exported 3090 GPUs. The GPUs work and I can run things like gpuburn, but when I try to train my models with NCCL I run into errors. I don't have a ACS option in bios (I believe its off now so no option) Supermicro H12SSL, but I do have IOMMU on so I can export the cards to...
  9. N

    Dual 3080 GPUs work in a single VM, but not if I split them to have one each in two VMs.

    So a bit more info, the 4 GPUs are one two x16 slots that are bifurcated into two x8 slots for each GPU. When I boot without the pcie_acs_override=downstream the first two cards are in Group 13 and the last two Group 49, so that wont work with 4 VMs each using 1 card. With the...
  10. N

    Dual 3080 GPUs work in a single VM, but not if I split them to have one each in two VMs.

    Sorry, your right I was not clear. It is still in the same state as the original post, I can start one or the other, but not both if they are not in the same VM. I am using pcie_acs_override, and see them each in a different group. /sys/kernel/iommu_groups/48/devices/0000:81:00.0...
  11. N

    Dual 3080 GPUs work in a single VM, but not if I split them to have one each in two VMs.

    Thanks, they are now in different groups (with pcie_acs_override), but without this they are not. Yes, plenty of RAM, its something with passthrough.
  12. N

    Dual 3080 GPUs work in a single VM, but not if I split them to have one each in two VMs.

    System Setup Proxmox 8.0.4 Supermicro H12SSL 1 Nvidia 4090 3 Nvidia 3080 Machine q35 virt101 - 3080 PCI Device 0000:02:00 virt103 - 4090 PCI Device 0000:01:00 I had virt 105 with two 3080s, PCI Device 0000:81:00 and 0000:82:00 Everything works great with this setup; I shut down 105, cloned...
  13. N

    Palo Alto Networks VM

    Figured it out, you need to add a serial port. :)
  14. N

    Palo Alto Networks VM

    Ever get past this? I am seeing the same issue.
  15. N

    CEPH monitor cannot be deleted when the node fails and goes offline !

    root@virt01:/var/lib/ceph# pveceph createmon --monid virt01 --mon-address 10.0.0.101 monitor 'virt01' already exists