/dev/nvidia0 missing on 2 of 3 mostly identical computers, sometimes (rarely) appear after a few hours

shookti · Dec 24, 2024

I am trying to set up a Slurm cluster using 3 nodes with the following specs:

- OS: Proxmox VE 8.1.4 x86_64
- Kernel: 6.5.13-1-pve
- CPU: AMD EPYC 7662
- GPU: NVIDIA GeForce RTX 4070 Ti
- Memory: 128 Gb

The packages on the nodes are mostly identical except from the packages added on node #1 (hostname: server1) after installing a few things. This node is the only node in which the /dev/nvidia0 file exists.

Packages I installed on server1:
- conda
- gnome desktop environment, failed to get it working
- a few others I don't remember that I really doubt would mess with nvidia drivers

For Slurm to make use of GPUs, they need to be configured for GRES.

The /etc/slurm/gres.conf file used to achieve that needs a path to the /dev/nvidia0 'device node' (is apparently what it's called according to ChatGPT).

This file however is missing on 2 of the 3 nodes:

Code:

    root@server1:~# ls /dev/nvidia0 ; ssh server2 ls /dev/nvidia0 ; ssh server3 ls /dev/nvidia0
    /dev/nvidia0
    ls: cannot access '/dev/nvidia0': No such file or directory
    ls: cannot access '/dev/nvidia0': No such file or directory

The file was created on server2 after a few hours of uptime with absolutely no usage after reinstalling cuda, this behaviour did not repeat. This behaviour was not shown by server3, even after reinstalling cuda, the file has not appeared at all.

This is happening after months of this file existing and normal behaviour, just before the files disappeared, all three nodes were unpowered for a couple of weeks. The period during which everything was fine contained a few hard-shutdowns and power cycles of all the nodes simultaneously.

What might be causing this issue? I have included the outputs of `nvidia-smi` and `dmesg | grep -i nvidia` in this post, please let me know if there are more pieces of information that might be relevant. Is there anything I can do via the Proxmox Browser interface to debug/fix this?

These are the outputs of `nvidia-smi`:

On server1:

Code:

root@server1:~# nvidia-smi
Tue Dec 24 16:18:04 2024     
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     Off |   00000000:C2:00.0 Off |                  N/A |
|  0%   29C    P8              5W /  285W |      20MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                      
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      4222      G   /usr/lib/xorg/Xorg                              9MiB |
|    0   N/A  N/A      4975      G   /usr/bin/gnome-shell                            4MiB |
+-----------------------------------------------------------------------------------------+

On server2:

Code:

root@server2:~# nvidia-smi
Tue Dec 24 16:23:57 2024     
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     Off |   00000000:C2:00.0 Off |                  N/A |
| 30%   33C    P0             N/A /  285W |       0MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                      
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

On server3:

Code:

root@server3:~# nvidia-smi
Tue Dec 24 16:25:32 2024     
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     Off |   00000000:C2:00.0 Off |                  N/A |
| 30%   29C    P0             31W /  285W |       0MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                      
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

These are the outputs of `dmesg | grep -i nvidia`:

On server1:

Code:

root@server1:~# dmesg | grep -i nvidia
[   16.573635] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   16.743891] nvidia-nvlink: Nvlink Core is being initialized, major device number 511
[   16.746998] nvidia 0000:c2:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[   16.793640] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.54.14  Thu Feb 22 01:44:30 UTC 2024
[   16.847931] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  550.54.14  Thu Feb 22 01:25:25 UTC 2024
[   16.874845] [drm] [nvidia-drm] [GPU ID 0x0000c200] Loading driver
[   16.874850] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:c2:00.0 on minor 1
[   17.447880] audit: type=1400 audit(1734333058.552:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=3129 comm="apparmor_parser"
[   17.447890] audit: type=1400 audit(1734333058.552:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=3129 comm="apparmor_parser"
[   19.914023] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input4
[   19.914161] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input5
[   19.914249] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input6
[   19.914341] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input7
[  490.145165] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[  490.205086] nvidia-uvm: Loaded the UVM driver, major device number 509.

On server2:

Code:

root@server2:~# dmesg | grep -i nvidia
[   15.802056] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   15.946456] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[   15.949422] nvidia 0000:c2:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[   15.999494] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.54.14  Thu Feb 22 01:44:30 UTC 2024
[   16.082782] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  550.54.14  Thu Feb 22 01:25:25 UTC 2024
[   16.133452] [drm] [nvidia-drm] [GPU ID 0x0000c200] Loading driver
[   16.133455] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:c2:00.0 on minor 1
[   16.684792] audit: type=1400 audit(1734333052.814:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=3022 comm="apparmor_parser"
[   16.684804] audit: type=1400 audit(1734333052.814:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=3022 comm="apparmor_parser"
[   19.217360] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input4
[   19.217465] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input5
[   19.217613] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input6
[   19.217689] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:c0/0000:c0:01.3/0000:c2:00.1/sound/card0/input7
[614766.966974] [drm] [nvidia-drm] [GPU ID 0x0000c200] Unloading driver
[614767.007398] nvidia-modeset: Unloading
[614767.055085] nvidia-nvlink: Unregistered Nvlink Core, major device number 510
[614792.941974] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[614792.945073] nvidia 0000:c2:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem
[614792.990426] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.54.14  Thu Feb 22 01:44:30 UTC 2024
[614793.118228] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[614793.174935] nvidia-uvm: Loaded the UVM driver, major device number 508.
[614793.193257] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  550.54.14  Thu Feb 22 01:25:25 UTC 2024
[614793.207944] [drm] [nvidia-drm] [GPU ID 0x0000c200] Loading driver
[614793.207949] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:c2:00.0 on minor 1
[614793.215255] [drm] [nvidia-drm] [GPU ID 0x0000c200] Unloading driver
[614793.235527] nvidia-modeset: Unloading
[614793.269550] nvidia-uvm: Unloaded the UVM driver.
[614793.308722] nvidia-nvlink: Unregistered Nvlink Core, major device number 510
[614830.614019] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[614830.617285] nvidia 0000:c2:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem
[614830.667271] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.54.14  Thu Feb 22 01:44:30 UTC 2024
[614830.682131] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  550.54.14  Thu Feb 22 01:25:25 UTC 2024
[614830.694063] [drm] [nvidia-drm] [GPU ID 0x0000c200] Loading driver
[614830.694066] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:c2:00.0 on minor 1
[702154.162815] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[702154.219888] nvidia-uvm: Loaded the UVM driver, major device number 508.

On server3:
Not included due to character limit on posts set by this forum, very similar to the output of server2.

Edit: the issue was fixed by running 'nvidia-persistenced' on the nodes without the file

Search

Search

/dev/nvidia0 missing on 2 of 3 mostly identical computers, sometimes (rarely) appear after a few hours

shookti

New Member

We value your privacy