I have a VM with four exported 3090 GPUs. The GPUs work and I can run things like gpuburn, but when I try to train my models with NCCL I run into errors. I don't have a ACS option in bios (I believe its off now so no option) Supermicro H12SSL, but I do have IOMMU on so I can export the cards to...
So a bit more info, the 4 GPUs are one two x16 slots that are bifurcated into two x8 slots for each GPU. When I boot without the pcie_acs_override=downstream the first two cards are in Group 13 and the last two Group 49, so that wont work with 4 VMs each using 1 card. With the...
Sorry, your right I was not clear. It is still in the same state as the original post, I can start one or the other, but not both if they are not in the same VM. I am using pcie_acs_override, and see them each in a different group.
/sys/kernel/iommu_groups/48/devices/0000:81:00.0...
System Setup
Proxmox 8.0.4
Supermicro H12SSL
1 Nvidia 4090
3 Nvidia 3080
Machine q35
virt101 - 3080 PCI Device 0000:02:00
virt103 - 4090 PCI Device 0000:01:00
I had virt 105 with two 3080s, PCI Device 0000:81:00 and 0000:82:00
Everything works great with this setup; I shut down 105, cloned...
I am having a similar problem, I have proxmox sees a monitor, but it has been removed by ceph:
root@virt01:/var/lib/ceph# pveceph mon destroy virt01
no such monitor id 'virt01'
root@virt01:/var/lib/ceph# ceph mon remove virt01
mon.virt01 does not exist or has already been removed...
Yep, they are both on 49, going to try a few BIOS settings if that does not work I found:
https://gitlab.com/Queuecumber/linux-acs-override
But rather see if I can do this without a custom kernel.
I have 2 Nvidia 3080s, on PCI 0000:01:00.0 and 0000:02:00.0, if I put them both on a VM, with x-vga=on and multifunction=on, it works, I nvidia-smi shows two GPUs. However, if I start vm1 with one of the GPUs say 0000:01:00.0, it will start fine, if I then try to start vm2 with GPU 0000:02:00.0...
I currently have a 21 server ceph cluster with 105 OSDs and I need to enable ceph authentication because Kubernetes can't mount a ceph volume without auth! I have looked at: https://docs.ceph.com/en/latest/rados/configuration/auth-config-ref/
Is that the only way with proxmox, or is there an...
I checked systemctl list-units on all hosts and don't see anything that does not belong, you mentioned directory, what should I look for there? Where does that GUI pannel pull its information?
Not sure if it is because I am in the middle of rebalancing, but when I restart OSDs I get a bunch of "active+undersized+degraded+remapped+backfill_wait:" PGs, so I stoped restarting them.
I upgraded ceph on my 21 node cluster from 14.2.15 to 14.2.20 and restarted all services except OSDs. I am using dual 40 gig ethernet and I was seeing about 1.8 GB/s on rebalancing, but now I am seeing less than 100 MB/s. CephFS has dropped to an embarrassing 61.5 MB/s with fio.
Jobs: 1 (f=1)...
When I go to my dashboard I see two virt01: in ceph Monitors, Managers and Meta Data Servers, I would like to get rid of the one with the ?, but not sure how this window is populated.
My ceph.conf looks normal as far as I can tell:
[global]
auth_client_required = none...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.