stuck when installing vGPU

ltm

New Member
Jul 10, 2024
1
0
1
Hi,
I am new to PVE,and meet some problems confused me recently.

my server info:
Code:
intel x99 motherboard

2* Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz

2T NVME

4* Nvidia-P40 gpu

according to pve doc,I use the following configuration:
pve-managerkernelvGPU Software BranchNVIDIA Host drivers
7.4-35.15.107-2-pve15.2525.105.14

and vGPU15.2 for kvm supports P40

after install the vGPU driver in PVE, I meet 3 big problems:
  1. ssh and command nvdia-smistuck for a long time, even kill the host
  2. there are some block log I do not know why:
    Code:
    Jul 09 18:36:49 pve kernel: INFO: task nvidia-vgpud:1123 blocked for more than 120 seconds.Jul 09 18:36:49 pve kernel:       Tainted: P           OE      6.5.11-8-pve #1
    Jul 09 18:36:49 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Jul 09 18:36:49 pve kernel: task:nvidia-vgpud    state:D stack:0     pid:1123  ppid:1      flags:0x00000002
    Jul 09 18:36:49 pve kernel: Call Trace:
    Jul 09 18:36:49 pve kernel:  <TASK>
    Jul 09 18:36:49 pve kernel:  __schedule+0x3fd/0x1450
    Jul 09 18:36:49 pve kernel:  ? __kmem_cache_alloc_node+0x1aa/0x360
    Jul 09 18:36:49 pve kernel:  ? os_alloc_mem+0xdd/0x100 [nvidia]
    Jul 09 18:36:49 pve kernel:  schedule+0x63/0x110
    Jul 09 18:36:49 pve kernel:  schedule_timeout+0x157/0x170
    Jul 09 18:36:49 pve kernel:  __down_common+0x111/0x210
    Jul 09 18:36:49 pve kernel:  __down+0x1d/0x30
    Jul 09 18:36:49 pve kernel:  down+0x54/0x80
    Jul 09 18:36:49 pve kernel:  nvidia_frontend_open+0x29/0xb0 [nvidia]
    Jul 09 18:36:49 pve kernel:  chrdev_open+0xcb/0x250
    Jul 09 18:36:49 pve kernel:  ? fsnotify_perm.part.0+0x83/0x200
    Jul 09 18:36:49 pve kernel:  ? __pfx_chrdev_open+0x10/0x10
    Jul 09 18:36:49 pve kernel:  do_dentry_open+0x220/0x530
    Jul 09 18:36:49 pve kernel:  vfs_open+0x33/0x50
    Jul 09 18:36:49 pve kernel:  path_openat+0xb1c/0x1180
    Jul 09 18:36:49 pve kernel:  ? chacha_block_generic+0x6d/0xc0
    Jul 09 18:36:49 pve kernel:  ? _get_random_bytes+0xcf/0x1b0
    Jul 09 18:36:49 pve kernel:  do_filp_open+0xaf/0x170
    Jul 09 18:36:49 pve kernel:  do_sys_openat2+0xb3/0xe0
    Jul 09 18:36:49 pve kernel:  __x64_sys_openat+0x6c/0xa0
    Jul 09 18:36:49 pve kernel:  do_syscall_64+0x5b/0x90
    Jul 09 18:36:49 pve kernel:  ? do_symlinkat+0xd6/0x150
    Jul 09 18:36:49 pve kernel:  ? exit_to_user_mode_prepare+0x39/0x190
    Jul 09 18:36:49 pve kernel:  ? syscall_exit_to_user_mode+0x37/0x60
    Jul 09 18:36:49 pve kernel:  ? do_syscall_64+0x67/0x90
    Jul 09 18:36:49 pve kernel:  ? exit_to_user_mode_prepare+0x39/0x190
    Jul 09 18:36:49 pve kernel:  ? syscall_exit_to_user_mode+0x37/0x60
    Jul 09 18:36:49 pve kernel:  ? do_syscall_64+0x67/0x90
    Jul 09 18:36:49 pve kernel:  ? syscall_exit_to_user_mode+0x37/0x60
    Jul 09 18:36:49 pve kernel:  ? do_syscall_64+0x67/0x90
    Jul 09 18:36:49 pve kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
    Jul 09 18:36:49 pve kernel: RIP: 0033:0x7f0423bedf01
    Jul 09 18:36:49 pve kernel: RSP: 002b:00007fff7d7ef520 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
    Jul 09 18:36:49 pve kernel: RAX: ffffffffffffffda RBX: 0000000000080002 RCX: 00007f0423bedf01
    Jul 09 18:36:49 pve kernel: RDX: 0000000000080002 RSI: 00007fff7d7ef5b0 RDI: 00000000ffffff9c
    Jul 09 18:36:49 pve kernel: RBP: 00007fff7d7ef5b0 R08: 0000000000000000 R09: 0000000000000064
    Jul 09 18:36:49 pve kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff7d7ef660
    Jul 09 18:36:49 pve kernel: R13: 00000000c1d00008 R14: 00000000d0040802 R15: 00000000c1d00008
    Jul 09 18:36:49 pve kernel:  </TASK>
    Jul 09 18:36:49 pve kernel: INFO: task nv_queue:1230 blocked for more than 120 seconds.
    Jul 09 18:36:49 pve kernel:       Tainted: P           OE      6.5.11-8-pve #1
    Jul 09 18:36:49 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Jul 09 18:36:49 pve kernel: task:nv_queue        state:D stack:0     pid:1230  ppid:2      flags:0x00004000
    Jul 09 18:36:49 pve kernel: Call Trace:
    Jul 09 18:36:49 pve kernel:  <TASK>
    Jul 09 18:36:49 pve kernel:  __schedule+0x3fd/0x1450
    Jul 09 18:36:49 pve kernel:  ? _nv010522rm+0xd0/0x250 [nvidia]
    Jul 09 18:36:49 pve kernel:  schedule+0x63/0x110
    Jul 09 18:36:49 pve kernel:  schedule_timeout+0x157/0x170
    Jul 09 18:36:49 pve kernel:  __down_common+0x111/0x210
    Jul 09 18:36:49 pve kernel:  ? finish_task_switch.isra.0+0x85/0x2c0
    Jul 09 18:36:49 pve kernel:  __down+0x1d/0x30
    Jul 09 18:36:49 pve kernel:  down+0x54/0x80
    Jul 09 18:36:49 pve kernel:  os_acquire_mutex+0x3c/0x70 [nvidia]
    Jul 09 18:36:49 pve kernel:  _nv042338rm+0x10/0x40 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? _nv013205rm+0x64d/0x7d0 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? _nv043295rm+0x122/0x180 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? _nv048990rm+0xeb/0x260 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? rm_execute_work_item+0x5e/0x130 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? os_execute_work_item+0x6c/0x90 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? _main_loop+0x82/0x140 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? __pfx__main_loop+0x10/0x10 [nvidia]
    Jul 09 18:36:49 pve kernel:  ? kthread+0xf2/0x120
    Jul 09 18:36:49 pve kernel:  ? __pfx_kthread+0x10/0x10
    Jul 09 18:36:49 pve kernel:  ? ret_from_fork+0x47/0x70
    Jul 09 18:36:49 pve kernel:  ? __pfx_kthread+0x10/0x10
    Jul 09 18:36:49 pve kernel:  ? ret_from_fork_asm+0x1b/0x30
    Jul 09 18:36:49 pve kernel:  </TASK>
    Jul 09 18:36:49 pve kernel: INFO: task (agetty):1251 blocked for more than 120 seconds.
    Jul 09 18:36:49 pve kernel:       Tainted: P           OE      6.5.11-8-pve #1
    Jul 09 18:36:49 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Jul 09 18:36:49 pve kernel: task:(agetty)        state:D stack:0     pid:1251  ppid:1      flags:0x00000002
    Jul 09 18:36:49 pve kernel: Call Trace:
    Jul 09 18:36:49 pve kernel:  <TASK>
    Jul 09 18:36:49 pve kernel:  __schedule+0x3fd/0x1450
    Jul 09 18:36:49 pve kernel:  ? hrtimer_try_to_cancel+0x87/0x120
    Jul 09 18:36:49 pve kernel:  ? schedule_hrtimeout_range_clock+0xc4/0x130
    Jul 09 18:36:49 pve kernel:  schedule+0x63/0x110
    Jul 09 18:36:49 pve kernel:  schedule_timeout+0x157/0x170
    Jul 09 18:36:49 pve kernel:  __down_common+0x111/0x210
    Jul 09 18:36:49 pve kernel:  __down+0x1d/0x30
    Jul 09 18:36:49 pve kernel:  down+0x54/0x80
    Jul 09 18:36:49 pve kernel:  console_lock+0x25/0x80
    Jul 09 18:36:49 pve kernel:  con_install+0x21/0x130
    Jul 09 18:36:49 pve kernel:  tty_init_dev.part.0+0x4e/0x280
    Jul 09 18:36:49 pve kernel:  tty_open+0x48d/0x6f0
    Jul 09 18:36:49 pve kernel:  chrdev_open+0xcb/0x250
    Jul 09 18:36:49 pve kernel:  ? fsnotify_perm.part.0+0x83/0x200
    Jul 09 18:36:49 pve kernel:  ? __pfx_chrdev_open+0x10/0x10
    Jul 09 18:36:49 pve kernel:  do_dentry_open+0x220/0x530
    Jul 09 18:36:49 pve kernel:  vfs_open+0x33/0x50
    Jul 09 18:36:49 pve kernel:  path_openat+0xb1c/0x1180
    Jul 09 18:36:49 pve kernel:  do_filp_open+0xaf/0x170
    Jul 09 18:36:49 pve kernel:  do_sys_openat2+0xb3/0xe0
    Jul 09 18:36:49 pve kernel:  __x64_sys_openat+0x6c/0xa0
    Jul 09 18:36:49 pve kernel:  do_syscall_64+0x5b/0x90
    Jul 09 18:36:49 pve kernel:  ? irqentry_exit_to_user_mode+0x17/0x20
    Jul 09 18:36:49 pve kernel:  ? irqentry_exit+0x43/0x50
    Jul 09 18:36:49 pve kernel:  ? exc_page_fault+0x94/0x1b0
    Jul 09 18:36:49 pve kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
    Jul 09 18:36:49 pve kernel: RIP: 0033:0x7f25a2116f80
    Jul 09 18:36:49 pve kernel: RSP: 002b:00007ffdb0e85060 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
    Jul 09 18:36:49 pve kernel: RAX: ffffffffffffffda RBX: 0000000000080902 RCX: 00007f25a2116f80
    Jul 09 18:36:49 pve kernel: RDX: 0000000000080902 RSI: 0000555822eee780 RDI: 00000000ffffff9c
    Jul 09 18:36:49 pve kernel: RBP: 0000555822eee780 R08: 0000000000000000 R09: 00007ffdb0e85150
    Jul 09 18:36:49 pve kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000080902
    Jul 09 18:36:49 pve kernel: R13: 0000555822eee780 R14: 00007ffdb0e85680 R15: 0000555822ee9510
    Jul 09 18:36:49 pve kernel:  </TASK>
    Jul 09 18:37:13 pve pvedaemon[1484]: <root@pam> successful auth for user 'root@pam'
    Jul 09 18:38:50 pve kernel: INFO: task nvidia-vgpud:1123 blocked for more than 241 seconds.
    Jul 09 18:38:50 pve kernel:       Tainted: P           OE      6.5.11-8-pve #1
    Jul 09 18:38:50 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Jul 09 18:38:50 pve kernel: task:nvidia-vgpud    state:D stack:0     pid:1123  ppid:1      flags:0x00000002
    Jul 09 18:38:50 pve kernel: Call Trace:
    Jul 09 18:38:50 pve kernel:  <TASK>
    Jul 09 18:38:50 pve kernel:  __schedule+0x3fd/0x1450
    Jul 09 18:38:50 pve kernel:  ? __kmem_cache_alloc_node+0x1aa/0x360
    Jul 09 18:38:50 pve kernel:  ? os_alloc_mem+0xdd/0x100 [nvidia]
    Jul 09 18:38:50 pve kernel:  schedule+0x63/0x110
    Jul 09 18:38:50 pve kernel:  schedule_timeout+0x157/0x170
    Jul 09 18:38:50 pve kernel:  __down_common+0x111/0x210
    Jul 09 18:38:50 pve kernel:  __down+0x1d/0x30
    Jul 09 18:38:50 pve kernel:  down+0x54/0x80
    Jul 09 18:38:50 pve kernel:  nvidia_frontend_open+0x29/0xb0 [nvidia]
    Jul 09 18:38:50 pve kernel:  chrdev_open+0xcb/0x250
    Jul 09 18:38:50 pve kernel:  ? fsnotify_perm.part.0+0x83/0x200
    Jul 09 18:38:50 pve kernel:  ? __pfx_chrdev_open+0x10/0x10
    Jul 09 18:38:50 pve kernel:  do_dentry_open+0x220/0x530
    Jul 09 18:38:50 pve kernel:  vfs_open+0x33/0x50
    Jul 09 18:38:50 pve kernel:  path_openat+0xb1c/0x1180
    Jul 09 18:38:50 pve kernel:  ? chacha_block_generic+0x6d/0xc0
    Jul 09 18:38:50 pve kernel:  ? _get_random_bytes+0xcf/0x1b0
    Jul 09 18:38:50 pve kernel:  do_filp_open+0xaf/0x170
    Jul 09 18:38:50 pve kernel:  do_sys_openat2+0xb3/0xe0
    Jul 09 18:38:50 pve kernel:  __x64_sys_openat+0x6c/0xa0
    Jul 09 18:38:50 pve kernel:  do_syscall_64+0x5b/0x90
    Jul 09 18:38:50 pve kernel:  ? do_symlinkat+0xd6/0x150
    Jul 09 18:38:50 pve kernel:  ? exit_to_user_mode_prepare+0x39/0x190
    Jul 09 18:38:50 pve kernel:  ? syscall_exit_to_user_mode+0x37/0x60
    Jul 09 18:38:50 pve kernel:  ? do_syscall_64+0x67/0x90
    Jul 09 18:38:50 pve kernel:  ? exit_to_user_mode_prepare+0x39/0x190
    Jul 09 18:38:50 pve kernel:  ? syscall_exit_to_user_mode+0x37/0x60
    Jul 09 18:38:50 pve kernel:  ? do_syscall_64+0x67/0x90
    Jul 09 18:38:50 pve kernel:  ? syscall_exit_to_user_mode+0x37/0x60
    Jul 09 18:38:50 pve kernel:  ? do_syscall_64+0x67/0x90
    Jul 09 18:38:50 pve kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
    Jul 09 18:38:50 pve kernel: RIP: 0033:0x7f0423bedf01
    Jul 09 18:38:50 pve kernel: RSP: 002b:00007fff7d7ef520 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
    Jul 09 18:38:50 pve kernel: RAX: ffffffffffffffda RBX: 0000000000080002 RCX: 00007f0423bedf01
    Jul 09 18:38:50 pve kernel: RDX: 0000000000080002 RSI: 00007fff7d7ef5b0 RDI: 00000000ffffff9c
    Jul 09 18:38:50 pve kernel: RBP: 00007fff7d7ef5b0 R08: 0000000000000000 R09: 0000000000000064
    Jul 09 18:38:50 pve kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff7d7ef660
    Jul 09 18:38:50 pve kernel: R13: 00000000c1d00008 R14: 00000000d0040802 R15: 00000000c1d00008
    Jul 09 18:38:50 pve kernel:  </TASK>
    Jul 09 18:38:50 pve kernel: INFO: task nv_queue:1230 blocked for more than 241 seconds.
    Jul 09 18:38:50 pve kernel:       Tainted: P           OE      6.5.11-8-pve #1
    Jul 09 18:38:50 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Jul 09 18:38:50 pve kernel: task:nv_queue        state:D stack:0     pid:1230  ppid:2      flags:0x00004000
    Jul 09 18:38:50 pve kernel: Call Trace:
    Jul 09 18:38:50 pve kernel:  <TASK>
    Jul 09 18:38:50 pve kernel:  __schedule+0x3fd/0x1450
    Jul 09 18:38:50 pve kernel:  ? _nv010522rm+0xd0/0x250 [nvidia]
    Jul 09 18:38:50 pve kernel:  schedule+0x63/0x110
    Jul 09 18:38:50 pve kernel:  schedule_timeout+0x157/0x170
    Jul 09 18:38:50 pve kernel:  __down_common+0x111/0x210
    Jul 09 18:38:50 pve kernel:  ? finish_task_switch.isra.0+0x85/0x2c0
    Jul 09 18:38:50 pve kernel:  __down+0x1d/0x30
    Jul 09 18:38:50 pve kernel:  down+0x54/0x80
    Jul 09 18:38:50 pve kernel:  os_acquire_mutex+0x3c/0x70 [nvidia]
    Jul 09 18:38:50 pve kernel:  _nv042338rm+0x10/0x40 [nvidia]
    Jul 09 18:38:50 pve kernel:  ? _nv013205rm+0x64d/0x7d0 [nvidia]
    Jul 09 18:38:50 pve kernel:  ? _nv043295rm+0x122/0x180 [nvidia]
    
    
    ................................
  3. nvidia-vgpu-mgr server use more than one hour to start
    Code:
    Jul 09 18:32:52 pve systemd[1]: Started nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon.
    .....
    
    Jul 09 19:43:19 pve nvidia-vgpu-mgr[1118]: notice: vmiop_env_log: nvidia-vgpu-mgr daemon started

I need help,anyone knows the reasons that stuck the vGPU installing?
Thanks