I am trying to set up a Slurm cluster using 3 nodes with the following specs:
- OS: Proxmox VE 8.1.4 x86_64
- Kernel: 6.5.13-1-pve
- CPU: AMD EPYC 7662
- GPU: NVIDIA GeForce RTX 4070 Ti
- Memory: 128 Gb
The packages on the nodes are mostly identical except from the packages added on node #1 (hostname: server1) after installing a few things. This node is the only node in which the /dev/nvidia0 file exists.
Packages I installed on server1:
- conda
- gnome desktop environment, failed to get it working
- a few others I don't remember that I really doubt would mess with nvidia drivers
For Slurm to make use of GPUs, they need to be configured for GRES.
The /etc/slurm/gres.conf file used to achieve that needs a path to the /dev/nvidia0 'device node' (is apparently what it's called according to ChatGPT).
This file however is missing on 2 of the 3 nodes:
The file was created on server2 after a few hours of uptime with absolutely no usage after reinstalling cuda, this behaviour did not repeat. This behaviour was not shown by server3, even after reinstalling cuda, the file has not appeared at all.
This is happening after months of this file existing and normal behaviour, just before the files disappeared, all three nodes were unpowered for a couple of weeks. The period during which everything was fine contained a few hard-shutdowns and power cycles of all the nodes simultaneously.
What might be causing this issue? I have included the outputs of `nvidia-smi` and `dmesg | grep -i nvidia` in this post, please let me know if there are more pieces of information that might be relevant. Is there anything I can do via the Proxmox Browser interface to debug/fix this?
These are the outputs of `nvidia-smi`:
On server1:
On server2:
On server3:
These are the outputs of `dmesg | grep -i nvidia`:
On server1:
On server2:
On server3:
Not included due to character limit on posts set by this forum, very similar to the output of server2.
- OS: Proxmox VE 8.1.4 x86_64
- Kernel: 6.5.13-1-pve
- CPU: AMD EPYC 7662
- GPU: NVIDIA GeForce RTX 4070 Ti
- Memory: 128 Gb
The packages on the nodes are mostly identical except from the packages added on node #1 (hostname: server1) after installing a few things. This node is the only node in which the /dev/nvidia0 file exists.
Packages I installed on server1:
- conda
- gnome desktop environment, failed to get it working
- a few others I don't remember that I really doubt would mess with nvidia drivers
For Slurm to make use of GPUs, they need to be configured for GRES.
The /etc/slurm/gres.conf file used to achieve that needs a path to the /dev/nvidia0 'device node' (is apparently what it's called according to ChatGPT).
This file however is missing on 2 of the 3 nodes:
Code:
root@server1:~# ls /dev/nvidia0 ; ssh server2 ls /dev/nvidia0 ; ssh server3 ls /dev/nvidia0
/dev/nvidia0
ls: cannot access '/dev/nvidia0': No such file or directory
ls: cannot access '/dev/nvidia0': No such file or directory
The file was created on server2 after a few hours of uptime with absolutely no usage after reinstalling cuda, this behaviour did not repeat. This behaviour was not shown by server3, even after reinstalling cuda, the file has not appeared at all.
This is happening after months of this file existing and normal behaviour, just before the files disappeared, all three nodes were unpowered for a couple of weeks. The period during which everything was fine contained a few hard-shutdowns and power cycles of all the nodes simultaneously.
What might be causing this issue? I have included the outputs of `nvidia-smi` and `dmesg | grep -i nvidia` in this post, please let me know if there are more pieces of information that might be relevant. Is there anything I can do via the Proxmox Browser interface to debug/fix this?
These are the outputs of `nvidia-smi`:
On server1:
Code:
root@server1:~# nvidia-smi
Tue Dec 24 16:18:04 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 Ti Off | 00000000:C2:00.0 Off | N/A |
| 0% 29C P8 5W / 285W | 20MiB / 12282MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 4222 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 4975 G /usr/bin/gnome-shell 4MiB |
+-----------------------------------------------------------------------------------------+
On server2:
Code:
root@server2:~# nvidia-smi
Tue Dec 24 16:23:57 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 Ti Off | 00000000:C2:00.0 Off | N/A |
| 30% 33C P0 N/A / 285W | 0MiB / 12282MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
On server3:
Code:
root@server3:~# nvidia-smi
Tue Dec 24 16:25:32 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 Ti Off | 00000000:C2:00.0 Off | N/A |
| 30% 29C P0 31W / 285W | 0MiB / 12282MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
These are the outputs of `dmesg | grep -i nvidia`:
On server1:
Code:
root@server1:~# dmesg | grep -i nvidia
[ 16.573635] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 16.743891] nvidia-nvlink: Nvlink Core is being initialized, major device number 511
[ 16.746998] nvidia 0000:c2:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 16.793640] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.54.14 Thu Feb 22 01:44:30 UTC 2024
[ 16.847931] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 550.54.14 Thu Feb 22 01:25:25 UTC 2024
[ 16.874845] [drm] [nvidia-drm] [GPU ID 0x0000c200] Loading driver
[ 16.874850] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:c2:00.0 on minor 1
[ 17.447880] audit: type=1400 audit(1734333058.552:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=3129 comm="apparmor_parser"
[ 17.447890] audit: type=1400 audit(1734333058.552:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=3129 comm="apparmor_parser"
[ 19.914023] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input4
[ 19.914161] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input5
[ 19.914249] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input6
[ 19.914341] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input7
[ 490.145165] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 490.205086] nvidia-uvm: Loaded the UVM driver, major device number 509.
On server2:
Code:
root@server2:~# dmesg | grep -i nvidia
[ 15.802056] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 15.946456] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[ 15.949422] nvidia 0000:c2:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 15.999494] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.54.14 Thu Feb 22 01:44:30 UTC 2024
[ 16.082782] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 550.54.14 Thu Feb 22 01:25:25 UTC 2024
[ 16.133452] [drm] [nvidia-drm] [GPU ID 0x0000c200] Loading driver
[ 16.133455] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:c2:00.0 on minor 1
[ 16.684792] audit: type=1400 audit(1734333052.814:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=3022 comm="apparmor_parser"
[ 16.684804] audit: type=1400 audit(1734333052.814:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=3022 comm="apparmor_parser"
[ 19.217360] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input4
[ 19.217465] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input5
[ 19.217613] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input6
[ 19.217689] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input7
[614766.966974] [drm] [nvidia-drm] [GPU ID 0x0000c200] Unloading driver
[614767.007398] nvidia-modeset: Unloading
[614767.055085] nvidia-nvlink: Unregistered Nvlink Core, major device number 510
[614792.941974] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[614792.945073] nvidia 0000:c2:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem
[614792.990426] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.54.14 Thu Feb 22 01:44:30 UTC 2024
[614793.118228] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[614793.174935] nvidia-uvm: Loaded the UVM driver, major device number 508.
[614793.193257] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 550.54.14 Thu Feb 22 01:25:25 UTC 2024
[614793.207944] [drm] [nvidia-drm] [GPU ID 0x0000c200] Loading driver
[614793.207949] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:c2:00.0 on minor 1
[614793.215255] [drm] [nvidia-drm] [GPU ID 0x0000c200] Unloading driver
[614793.235527] nvidia-modeset: Unloading
[614793.269550] nvidia-uvm: Unloaded the UVM driver.
[614793.308722] nvidia-nvlink: Unregistered Nvlink Core, major device number 510
[614830.614019] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[614830.617285] nvidia 0000:c2:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem
[614830.667271] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.54.14 Thu Feb 22 01:44:30 UTC 2024
[614830.682131] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 550.54.14 Thu Feb 22 01:25:25 UTC 2024
[614830.694063] [drm] [nvidia-drm] [GPU ID 0x0000c200] Loading driver
[614830.694066] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:c2:00.0 on minor 1
[702154.162815] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[702154.219888] nvidia-uvm: Loaded the UVM driver, major device number 508.
On server3:
Not included due to character limit on posts set by this forum, very similar to the output of server2.
Last edited: