stuck when installing vGPU

ltm

New Member
Jul 10, 2024
1
0
1
Hi,
I am new to PVE,and meet some problems confused me recently.

my server info:
Code:
intel x99 motherboard

2* Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz

2T NVME

4* Nvidia-P40 gpu

according to pve doc,I use the following configuration:
pve-managerkernelvGPU Software BranchNVIDIA Host drivers
7.4-35.15.107-2-pve15.2525.105.14

and vGPU15.2 for kvm supports P40

after install the vGPU driver in PVE, I meet 3 big problems:
  1. ssh and command nvdia-smistuck for a long time, even kill the host
  2. there are some block log I do not know why:
    Code:
    Jul 09 18:36:49 pve kernel: INFO: task nvidia-vgpud:1123 blocked for more than 120 seconds.Jul 09 18:36:49 pve kernel:       Tainted: P           OE      6.5.11-8-pve #1
    Jul 09 18:36:49 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Jul 09 18:36:49 pve kernel: task:nvidia-vgpud    state:D stack:0     pid:1123  ppid:1      flags:0x00000002
    Jul 09 18:36:49 pve kernel: Call Trace:
    Jul 09 18:36:49 pve kernel:  <TASK>
    Jul 09 18:36:49 pve kernel:  __schedule+0x3fd/0x1450
    Jul 09 18:36:49 pve kernel:  ? __kmem_cache_alloc_node+0x1aa/0x360
    Jul 09 18:36:49 pve kernel:  ? os_alloc_mem+0xdd/0x100 [nvidia]
    Jul 09 18:36:49 pve kernel:  schedule+0x63/0x110
    Jul 09 18:36:49 pve kernel:  schedule_timeout+0x157/0x170
    Jul 09 18:36:49 pve kernel:  __down_common+0x111/0x210
    Jul 09 18:36:49 pve kernel:  __down+0x1d/0x30
    Jul 09 18:36:49 pve kernel:  down+0x54/0x80
    Jul 09 18:36:49 pve kernel:  nvidia_frontend_open+0x29/0xb0 [nvidia]
    Jul 09 18:36:49 pve kernel:  chrdev_open+0xcb/0x250
    Jul 09 18:36:49 pve kernel:  ? fsnotify_perm.part.0+0x83/0x200
    Jul 09 18:36:49 pve kernel:  ? __pfx_chrdev_open+0x10/0x10
    Jul 09 18:36:49 pve kernel:  do_dentry_open+0x220/0x530
    Jul 09 18:36:49 pve kernel:  vfs_open+0x33/0x50
    Jul 09 18:36:49 pve kernel:  path_openat+0xb1c/0x1180
    Jul 09 18:36:49 pve kernel:  ? chacha_block_generic+0x6d/0xc0
    Jul 09 18:36:49 pve kernel:  ? _get_random_bytes+0xcf/0x1b0
    Jul 09 18:36:49 pve kernel:  do_filp_open+0xaf/0x170
    Jul 09 18:36:49 pve kernel:  do_sys_openat2+0xb3/0xe0
    Jul 09 18:36:49 pve kernel:  __x64_sys_openat+0x6c/0xa0
    Jul 09 18:36:49 pve kernel:  do_syscall_64+0x5b/0x90
    Jul 09 18:36:49 pve kernel:  ? do_symlinkat+0xd6/0x150
    Jul 09 18:36:49 pve kernel:  ? exit_to_user_mode_prepare+0x39/0x190
    Jul 09 18:36:49 pve kernel:  ? syscall_exit_to_user_mode+0x37/0x60
    Jul 09 18:36:49 pve kernel:  ? do_syscall_64+0x67/0x90
    Jul 09 18:36:49 pve kernel:  ? exit_to_user_mode_prepare+0x39/0x190
    Jul 09 18:36:49 pve kernel:  ? syscall_exit_to_user_mode+0x37/0x60
    Jul 09 18:36:49 pve kernel:  ? do_syscall_64+0x67/0x90
    Jul 09 18:36:49 pve kernel:  ? syscall_exit_to_user_mode+0x37/0x60
    Jul 09 18:36:49 pve kernel:  ? do_syscall_64+0x67/0x90
    Jul 09 18:36:49 pve kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
    Jul 09 18:36:49 pve kernel: RIP: 0033:0x7f0423bedf01
    Jul 09 18:36:49 pve kernel: RSP: 002b:00007fff7d7ef520 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
    Jul 09 18:36:49 pve kernel: RAX: ffffffffffffffda RBX: 0000000000080002 RCX: 00007f0423bedf01
    Jul 09 18:36:49 pve kernel: RDX: 0000000000080002 RSI: 00007fff7d7ef5b0 RDI: 00000000ffffff9c
    Jul 09 18:36:49 pve kernel: RBP: 00007fff7d7ef5b0 R08: 0000000000000000 R09: 0000000000000064
    Jul 09 18:36:49 pve kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff7d7ef660
    Jul 09 18:36:49 pve kernel: R13: 00000000c1d00008 R14: 00000000d0040802 R15: 00000000c1d00008
    Jul 09 18:36:49 pve kernel:  </TASK>
    Jul 09 18:36:49 pve kernel: INFO: task nv_queue:1230 blocked for more than 120 seconds.
    Jul 09 18:36:49 pve kernel:       Tainted: P           OE      6.5.11-8-pve #1
    Jul 09 18:36:49 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Jul 09 18:36:49 pve kernel: task:nv_queue        state:D stack:0     pid:1230  ppid:2      flags:0x00004000
    Jul 09 18:36:49 pve kernel: Call Trace:
    Jul 09 18:36:49 pve kernel:  <TASK>
    Jul 09 18:36:49 pve kernel:  __schedule+0x3fd/0x1450
    Jul 09 18:36:49 pve kernel:  ? _nv010522rm+0xd0/0x250 [nvidia]
    Jul 09 18:36:49 pve kernel:  schedule+0x63/0x110
    Jul 09 18:36:49 pve kernel:  schedule_timeout+0x157/0x170
    Jul 09 18:36:49 pve kernel:  __down_common+0x111/0x210
    Jul 09 18:36:49 pve kernel:  ? finish_task_switch.isra.0+0x85/0x2c0
    Jul 09 18:36:49 pve kernel:  __down+0x1d/0x30
    Jul 09 18:36:49 pve kernel:  down+0x54/0x80
    Jul 09 18:36:49 pve kernel:  os_acquire_mutex+0x3c/0x70 [nvidia]
    Jul 09 18:36:49 pve kernel:  _nv042338rm+0x10/0x40 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? _nv013205rm+0x64d/0x7d0 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? _nv043295rm+0x122/0x180 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? _nv048990rm+0xeb/0x260 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? rm_execute_work_item+0x5e/0x130 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? os_execute_work_item+0x6c/0x90 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? _main_loop+0x82/0x140 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? __pfx__main_loop+0x10/0x10 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? kthread+0xf2/0x120
    Jul 09 18:36:49 pve kernel:  ? __pfx_kthread+0x10/0x10
    Jul 09 18:36:49 pve kernel:  ? ret_from_fork+0x47/0x70
    Jul 09 18:36:49 pve kernel:  ? __pfx_kthread+0x10/0x10
    Jul 09 18:36:49 pve kernel:  ? ret_from_fork_asm+0x1b/0x30
    Jul 09 18:36:49 pve kernel:  </TASK>
    Jul 09 18:36:49 pve kernel: INFO: task (agetty):1251 blocked for more than 120 seconds.
    Jul 09 18:36:49 pve kernel:       Tainted: P           OE      6.5.11-8-pve #1
    Jul 09 18:36:49 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Jul 09 18:36:49 pve kernel: task:(agetty)        state:D stack:0     pid:1251  ppid:1      flags:0x00000002
    Jul 09 18:36:49 pve kernel: Call Trace:
    Jul 09 18:36:49 pve kernel:  <TASK>
    Jul 09 18:36:49 pve kernel:  __schedule+0x3fd/0x1450
    Jul 09 18:36:49 pve kernel:  ? hrtimer_try_to_cancel+0x87/0x120
    Jul 09 18:36:49 pve kernel:  ? schedule_hrtimeout_range_clock+0xc4/0x130
    Jul 09 18:36:49 pve kernel:  schedule+0x63/0x110
    Jul 09 18:36:49 pve kernel:  schedule_timeout+0x157/0x170
    Jul 09 18:36:49 pve kernel:  __down_common+0x111/0x210
    Jul 09 18:36:49 pve kernel:  __down+0x1d/0x30
    Jul 09 18:36:49 pve kernel:  down+0x54/0x80
    Jul 09 18:36:49 pve kernel:  console_lock+0x25/0x80
    Jul 09 18:36:49 pve kernel:  con_install+0x21/0x130
    Jul 09 18:36:49 pve kernel:  tty_init_dev.part.0+0x4e/0x280
    Jul 09 18:36:49 pve kernel:  tty_open+0x48d/0x6f0
    Jul 09 18:36:49 pve kernel:  chrdev_open+0xcb/0x250
    Jul 09 18:36:49 pve kernel:  ? fsnotify_perm.part.0+0x83/0x200
    Jul 09 18:36:49 pve kernel:  ? __pfx_chrdev_open+0x10/0x10
    Jul 09 18:36:49 pve kernel:  do_dentry_open+0x220/0x530
    Jul 09 18:36:49 pve kernel:  vfs_open+0x33/0x50
    Jul 09 18:36:49 pve kernel:  path_openat+0xb1c/0x1180
    Jul 09 18:36:49 pve kernel:  do_filp_open+0xaf/0x170
    Jul 09 18:36:49 pve kernel:  do_sys_openat2+0xb3/0xe0
    Jul 09 18:36:49 pve kernel:  __x64_sys_openat+0x6c/0xa0
    Jul 09 18:36:49 pve kernel:  do_syscall_64+0x5b/0x90
    Jul 09 18:36:49 pve kernel:  ? irqentry_exit_to_user_mode+0x17/0x20
    Jul 09 18:36:49 pve kernel:  ? irqentry_exit+0x43/0x50
    Jul 09 18:36:49 pve kernel:  ? exc_page_fault+0x94/0x1b0
    Jul 09 18:36:49 pve kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
    Jul 09 18:36:49 pve kernel: RIP: 0033:0x7f25a2116f80
    Jul 09 18:36:49 pve kernel: RSP: 002b:00007ffdb0e85060 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
    Jul 09 18:36:49 pve kernel: RAX: ffffffffffffffda RBX: 0000000000080902 RCX: 00007f25a2116f80
    Jul 09 18:36:49 pve kernel: RDX: 0000000000080902 RSI: 0000555822eee780 RDI: 00000000ffffff9c
    Jul 09 18:36:49 pve kernel: RBP: 0000555822eee780 R08: 0000000000000000 R09: 00007ffdb0e85150
    Jul 09 18:36:49 pve kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000080902
    Jul 09 18:36:49 pve kernel: R13: 0000555822eee780 R14: 00007ffdb0e85680 R15: 0000555822ee9510
    Jul 09 18:36:49 pve kernel:  </TASK>
    Jul 09 18:37:13 pve pvedaemon[1484]: <root@pam> successful auth for user 'root@pam'
    Jul 09 18:38:50 pve kernel: INFO: task nvidia-vgpud:1123 blocked for more than 241 seconds.
    Jul 09 18:38:50 pve kernel:       Tainted: P           OE      6.5.11-8-pve #1
    Jul 09 18:38:50 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Jul 09 18:38:50 pve kernel: task:nvidia-vgpud    state:D stack:0     pid:1123  ppid:1      flags:0x00000002
    Jul 09 18:38:50 pve kernel: Call Trace:
    Jul 09 18:38:50 pve kernel:  <TASK>
    Jul 09 18:38:50 pve kernel:  __schedule+0x3fd/0x1450
    Jul 09 18:38:50 pve kernel:  ? __kmem_cache_alloc_node+0x1aa/0x360
    Jul 09 18:38:50 pve kernel:  ? os_alloc_mem+0xdd/0x100 [nvidia]
    Jul 09 18:38:50 pve kernel:  schedule+0x63/0x110
    Jul 09 18:38:50 pve kernel:  schedule_timeout+0x157/0x170
    Jul 09 18:38:50 pve kernel:  __down_common+0x111/0x210
    Jul 09 18:38:50 pve kernel:  __down+0x1d/0x30
    Jul 09 18:38:50 pve kernel:  down+0x54/0x80
    Jul 09 18:38:50 pve kernel:  nvidia_frontend_open+0x29/0xb0 [nvidia]
    Jul 09 18:38:50 pve kernel:  chrdev_open+0xcb/0x250
    Jul 09 18:38:50 pve kernel:  ? fsnotify_perm.part.0+0x83/0x200
    Jul 09 18:38:50 pve kernel:  ? __pfx_chrdev_open+0x10/0x10
    Jul 09 18:38:50 pve kernel:  do_dentry_open+0x220/0x530
    Jul 09 18:38:50 pve kernel:  vfs_open+0x33/0x50
    Jul 09 18:38:50 pve kernel:  path_openat+0xb1c/0x1180
    Jul 09 18:38:50 pve kernel:  ? chacha_block_generic+0x6d/0xc0
    Jul 09 18:38:50 pve kernel:  ? _get_random_bytes+0xcf/0x1b0
    Jul 09 18:38:50 pve kernel:  do_filp_open+0xaf/0x170
    Jul 09 18:38:50 pve kernel:  do_sys_openat2+0xb3/0xe0
    Jul 09 18:38:50 pve kernel:  __x64_sys_openat+0x6c/0xa0
    Jul 09 18:38:50 pve kernel:  do_syscall_64+0x5b/0x90
    Jul 09 18:38:50 pve kernel:  ? do_symlinkat+0xd6/0x150
    Jul 09 18:38:50 pve kernel:  ? exit_to_user_mode_prepare+0x39/0x190
    Jul 09 18:38:50 pve kernel:  ? syscall_exit_to_user_mode+0x37/0x60
    Jul 09 18:38:50 pve kernel:  ? do_syscall_64+0x67/0x90
    Jul 09 18:38:50 pve kernel:  ? exit_to_user_mode_prepare+0x39/0x190
    Jul 09 18:38:50 pve kernel:  ? syscall_exit_to_user_mode+0x37/0x60
    Jul 09 18:38:50 pve kernel:  ? do_syscall_64+0x67/0x90
    Jul 09 18:38:50 pve kernel:  ? syscall_exit_to_user_mode+0x37/0x60
    Jul 09 18:38:50 pve kernel:  ? do_syscall_64+0x67/0x90
    Jul 09 18:38:50 pve kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
    Jul 09 18:38:50 pve kernel: RIP: 0033:0x7f0423bedf01
    Jul 09 18:38:50 pve kernel: RSP: 002b:00007fff7d7ef520 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
    Jul 09 18:38:50 pve kernel: RAX: ffffffffffffffda RBX: 0000000000080002 RCX: 00007f0423bedf01
    Jul 09 18:38:50 pve kernel: RDX: 0000000000080002 RSI: 00007fff7d7ef5b0 RDI: 00000000ffffff9c
    Jul 09 18:38:50 pve kernel: RBP: 00007fff7d7ef5b0 R08: 0000000000000000 R09: 0000000000000064
    Jul 09 18:38:50 pve kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff7d7ef660
    Jul 09 18:38:50 pve kernel: R13: 00000000c1d00008 R14: 00000000d0040802 R15: 00000000c1d00008
    Jul 09 18:38:50 pve kernel:  </TASK>
    Jul 09 18:38:50 pve kernel: INFO: task nv_queue:1230 blocked for more than 241 seconds.
    Jul 09 18:38:50 pve kernel:       Tainted: P           OE      6.5.11-8-pve #1
    Jul 09 18:38:50 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Jul 09 18:38:50 pve kernel: task:nv_queue        state:D stack:0     pid:1230  ppid:2      flags:0x00004000
    Jul 09 18:38:50 pve kernel: Call Trace:
    Jul 09 18:38:50 pve kernel:  <TASK>
    Jul 09 18:38:50 pve kernel:  __schedule+0x3fd/0x1450
    Jul 09 18:38:50 pve kernel:  ? _nv010522rm+0xd0/0x250 [nvidia]
    Jul 09 18:38:50 pve kernel:  schedule+0x63/0x110
    Jul 09 18:38:50 pve kernel:  schedule_timeout+0x157/0x170
    Jul 09 18:38:50 pve kernel:  __down_common+0x111/0x210
    Jul 09 18:38:50 pve kernel:  ? finish_task_switch.isra.0+0x85/0x2c0
    Jul 09 18:38:50 pve kernel:  __down+0x1d/0x30
    Jul 09 18:38:50 pve kernel:  down+0x54/0x80
    Jul 09 18:38:50 pve kernel:  os_acquire_mutex+0x3c/0x70 [nvidia]
    Jul 09 18:38:50 pve kernel:  _nv042338rm+0x10/0x40 [nvidia]
    Jul 09 18:38:50 pve kernel:  ? _nv013205rm+0x64d/0x7d0 [nvidia]
    Jul 09 18:38:50 pve kernel:  ? _nv043295rm+0x122/0x180 [nvidia]
    
    
    ................................
  3. nvidia-vgpu-mgr server use more than one hour to start
    Code:
    Jul 09 18:32:52 pve systemd[1]: Started nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon.
    .....
    
    Jul 09 19:43:19 pve nvidia-vgpu-mgr[1118]: notice: vmiop_env_log: nvidia-vgpu-mgr daemon started

I need help,anyone knows the reasons that stuck the vGPU installing?
Thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!