Recent content by Nathan Stratton

  1. N

    IOMMU 4 NVIDIA GPUs with NCCL

    I have a VM with four exported 3090 GPUs. The GPUs work and I can run things like gpuburn, but when I try to train my models with NCCL I run into errors. I don't have a ACS option in bios (I believe its off now so no option) Supermicro H12SSL, but I do have IOMMU on so I can export the cards to...
  2. N

    Dual 3080 GPUs work in a single VM, but not if I split them to have one each in two VMs.

    So a bit more info, the 4 GPUs are one two x16 slots that are bifurcated into two x8 slots for each GPU. When I boot without the pcie_acs_override=downstream the first two cards are in Group 13 and the last two Group 49, so that wont work with 4 VMs each using 1 card. With the...
  3. N

    Dual 3080 GPUs work in a single VM, but not if I split them to have one each in two VMs.

    Sorry, your right I was not clear. It is still in the same state as the original post, I can start one or the other, but not both if they are not in the same VM. I am using pcie_acs_override, and see them each in a different group. /sys/kernel/iommu_groups/48/devices/0000:81:00.0...
  4. N

    Dual 3080 GPUs work in a single VM, but not if I split them to have one each in two VMs.

    Thanks, they are now in different groups (with pcie_acs_override), but without this they are not. Yes, plenty of RAM, its something with passthrough.
  5. N

    Dual 3080 GPUs work in a single VM, but not if I split them to have one each in two VMs.

    System Setup Proxmox 8.0.4 Supermicro H12SSL 1 Nvidia 4090 3 Nvidia 3080 Machine q35 virt101 - 3080 PCI Device 0000:02:00 virt103 - 4090 PCI Device 0000:01:00 I had virt 105 with two 3080s, PCI Device 0000:81:00 and 0000:82:00 Everything works great with this setup; I shut down 105, cloned...
  6. N

    Palo Alto Networks VM

    Figured it out, you need to add a serial port. :)
  7. N

    Palo Alto Networks VM

    Ever get past this? I am seeing the same issue.
  8. N

    CEPH monitor cannot be deleted when the node fails and goes offline !

    root@virt01:/var/lib/ceph# pveceph createmon --monid virt01 --mon-address 10.0.0.101 monitor 'virt01' already exists
  9. N

    CEPH monitor cannot be deleted when the node fails and goes offline !

    I am having a similar problem, I have proxmox sees a monitor, but it has been removed by ceph: root@virt01:/var/lib/ceph# pveceph mon destroy virt01 no such monitor id 'virt01' root@virt01:/var/lib/ceph# ceph mon remove virt01 mon.virt01 does not exist or has already been removed...
  10. N

    Dual Nvidia 3080 GPUs work on same VM, but not if I have 2 VMs with 1 3080 GPU on each.

    Yep, they are both on 49, going to try a few BIOS settings if that does not work I found: https://gitlab.com/Queuecumber/linux-acs-override But rather see if I can do this without a custom kernel.
  11. N

    Dual Nvidia 3080 GPUs work on same VM, but not if I have 2 VMs with 1 3080 GPU on each.

    I have 2 Nvidia 3080s, on PCI 0000:01:00.0 and 0000:02:00.0, if I put them both on a VM, with x-vga=on and multifunction=on, it works, I nvidia-smi shows two GPUs. However, if I start vm1 with one of the GPUs say 0000:01:00.0, it will start fine, if I then try to start vm2 with GPU 0000:02:00.0...
  12. N

    Easy way to enable ceph authentication on a 21 server cluster without auth?

    I currently have a 21 server ceph cluster with 105 OSDs and I need to enable ceph authentication because Kubernetes can't mount a ceph volume without auth! I have looked at: https://docs.ceph.com/en/latest/rados/configuration/auth-config-ref/ Is that the only way with proxmox, or is there an...
  13. N

    Problem to configure network for guest with tagged and untagged vlan

    I also believe this is correct, I have not been able to create tagged and untagged in the same bridge.
  14. N

    Getting rid of phantom node

    I checked systemctl list-units on all hosts and don't see anything that does not belong, you mentioned directory, what should I look for there? Where does that GUI pannel pull its information?
  15. N

    Ceph performance dropped after upgrade from 14.2.15 to 14.2.20

    Not sure if it is because I am in the middle of rebalancing, but when I restart OSDs I get a bunch of "active+undersized+degraded+remapped+backfill_wait:" PGs, so I stoped restarting them.

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!