Search results

  1. N

    Proxmox with 48 nodes

    I know in the past that the recommended max number of nodes in a cluster was 32, but is this still the case? My boxes are all dual E5-2690v4 with dual 40 Gig Ethernet. I would like to have one cluster with 48 nodes, but is that a bad idea? Should I go two with 24 nodes?
  2. N

    bash sleep 10 is very inconsistent on VM

    Running Proxmox 9.0.10 with Ubuntu 24.04 guest. I have been working with voipmonitor support staff who say my issues are because of my clock. I tested a simple script that SHOULD run every 10 seconds, but does not. This is on the weekend when my load is almost zero. Any ideas? #!/bin/bash while...
  3. N

    How do I remove duplicate unknown monitor, manager, and MDS?

    I tried to delete and recreate the service, but the ? unknown service came back when I created a new one for virt2. Any ideas on how to clean this up?
  4. N

    IOMMU 4 NVIDIA GPUs with NCCL

    I have a VM with four exported 3090 GPUs. The GPUs work and I can run things like gpuburn, but when I try to train my models with NCCL I run into errors. I don't have a ACS option in bios (I believe its off now so no option) Supermicro H12SSL, but I do have IOMMU on so I can export the cards to...
  5. N

    Dual 3080 GPUs work in a single VM, but not if I split them to have one each in two VMs.

    System Setup Proxmox 8.0.4 Supermicro H12SSL 1 Nvidia 4090 3 Nvidia 3080 Machine q35 virt101 - 3080 PCI Device 0000:02:00 virt103 - 4090 PCI Device 0000:01:00 I had virt 105 with two 3080s, PCI Device 0000:81:00 and 0000:82:00 Everything works great with this setup; I shut down 105, cloned...
  6. N

    Dual Nvidia 3080 GPUs work on same VM, but not if I have 2 VMs with 1 3080 GPU on each.

    I have 2 Nvidia 3080s, on PCI 0000:01:00.0 and 0000:02:00.0, if I put them both on a VM, with x-vga=on and multifunction=on, it works, I nvidia-smi shows two GPUs. However, if I start vm1 with one of the GPUs say 0000:01:00.0, it will start fine, if I then try to start vm2 with GPU 0000:02:00.0...
  7. N

    Easy way to enable ceph authentication on a 21 server cluster without auth?

    I currently have a 21 server ceph cluster with 105 OSDs and I need to enable ceph authentication because Kubernetes can't mount a ceph volume without auth! I have looked at: https://docs.ceph.com/en/latest/rados/configuration/auth-config-ref/ Is that the only way with proxmox, or is there an...
  8. N

    Ceph performance dropped after upgrade from 14.2.15 to 14.2.20

    I upgraded ceph on my 21 node cluster from 14.2.15 to 14.2.20 and restarted all services except OSDs. I am using dual 40 gig ethernet and I was seeing about 1.8 GB/s on rebalancing, but now I am seeing less than 100 MB/s. CephFS has dropped to an embarrassing 61.5 MB/s with fio. Jobs: 1 (f=1)...
  9. N

    Getting rid of phantom node

    When I go to my dashboard I see two virt01: in ceph Monitors, Managers and Meta Data Servers, I would like to get rid of the one with the ?, but not sure how this window is populated. My ceph.conf looks normal as far as I can tell: [global] auth_client_required = none...
  10. N

    LACP two 40 gbit/s eth or 40 gbit/s eth + 56 gbit/s infiniband

    I am upgrading a 16 node cluster that has 2 NVMe drives and 3 SATA drives used for ceph. My network cards are Mellanox MCX354A-FCBT and have 2 QSFP ports that can be configured as Infiniband or Ethernet. My question is how best should I utilize the two ports. My options are: 1) LACP into VPC...
  11. N

    pvesr.service starting every minute

    Running latest proxmox, everyting looks like its working, but I noticed that on all 16 hosts pvesr.service is starting every minute. syslog:Dec 28 18:35:00 virt0 systemd[1]: pvesr.service: Succeeded. syslog:Dec 28 18:36:00 virt0 systemd[1]: pvesr.service: Succeeded. syslog:Dec 28 18:37:00 virt0...
  12. N

    ceph nvme ssd slower than spinning disks16 node 40 gbe ceph cluster

    I am running the latest version of proxmox on a 16 node 40 gbe cluster. Each node has 2 Samsung 960 EVO 250GB NVMe SSDs and 3 Hitachi 2 TB 7200 RPM Ultrastar disks. I am using bluestore for all disks with two crush rules, one for fast nvme and slow for hdd. I have tested bandwidth between all...
  13. N

    Ceph slower after upgrading all nodes from 10 to 40 gig ethernet

    I have been upgrading my customer node by node from 10 gig to 40 gig, now that all nodes are 40 gig I am seeing very very slow ceph. Setup is Dual E5-2690v2 with dual 40 gig bond into a cisco 3132q. OSDs are Samsung 960 EVO 256G NVMe with 2 in each server. Total time run: 92.179118...
  14. N

    Storage CephFS and multiple file systems (passing options)

    I am using Proxmox 5.4 with CephFS and multiple file systems. One filesystem is called cephfs and it's on NVMe and the other is cephfs_slow and it's on standard SATA. I can manually mount each file system with: mount -t ceph -o mds_namespace=cephfs virt0,virt4,virt8,virt12:/ /foo mount -t ceph...
  15. N

    apt-get update && apt-get dist-upgrade (failed)

    Setting up pve-cluster (5.0-31) ... Job for pve-ha-lrm.service failed because the control process exited with error code. See "systemctl status pve-ha-lrm.service" and "journalctl -xe" for details. dpkg: error processing package pve-cluster (--configure): subprocess installed post-installation...