Hi,
I am new to PVE,and meet some problems confused me recently.
my server info:
according to pve doc,I use the following configuration:
and vGPU15.2 for kvm supports P40
after install the vGPU driver in PVE, I meet 3 big problems:
I need help,anyone knows the reasons that stuck the vGPU installing?
Thanks
I am new to PVE,and meet some problems confused me recently.
my server info:
Code:
intel x99 motherboard
2* Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
2T NVME
4* Nvidia-P40 gpu
according to pve doc,I use the following configuration:
pve-manager | kernel | vGPU Software Branch | NVIDIA Host drivers |
7.4-3 | 5.15.107-2-pve | 15.2 | 525.105.14 |
and vGPU15.2 for kvm supports P40
after install the vGPU driver in PVE, I meet 3 big problems:
- ssh and command
nvdia-smi
stuck for a long time, even kill the host - there are some block log I do not know why:
Code:Jul 09 18:36:49 pve kernel: INFO: task nvidia-vgpud:1123 blocked for more than 120 seconds.Jul 09 18:36:49 pve kernel: Tainted: P OE 6.5.11-8-pve #1 Jul 09 18:36:49 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 09 18:36:49 pve kernel: task:nvidia-vgpud state:D stack:0 pid:1123 ppid:1 flags:0x00000002 Jul 09 18:36:49 pve kernel: Call Trace: Jul 09 18:36:49 pve kernel: <TASK> Jul 09 18:36:49 pve kernel: __schedule+0x3fd/0x1450 Jul 09 18:36:49 pve kernel: ? __kmem_cache_alloc_node+0x1aa/0x360 Jul 09 18:36:49 pve kernel: ? os_alloc_mem+0xdd/0x100 [nvidia] Jul 09 18:36:49 pve kernel: schedule+0x63/0x110 Jul 09 18:36:49 pve kernel: schedule_timeout+0x157/0x170 Jul 09 18:36:49 pve kernel: __down_common+0x111/0x210 Jul 09 18:36:49 pve kernel: __down+0x1d/0x30 Jul 09 18:36:49 pve kernel: down+0x54/0x80 Jul 09 18:36:49 pve kernel: nvidia_frontend_open+0x29/0xb0 [nvidia] Jul 09 18:36:49 pve kernel: chrdev_open+0xcb/0x250 Jul 09 18:36:49 pve kernel: ? fsnotify_perm.part.0+0x83/0x200 Jul 09 18:36:49 pve kernel: ? __pfx_chrdev_open+0x10/0x10 Jul 09 18:36:49 pve kernel: do_dentry_open+0x220/0x530 Jul 09 18:36:49 pve kernel: vfs_open+0x33/0x50 Jul 09 18:36:49 pve kernel: path_openat+0xb1c/0x1180 Jul 09 18:36:49 pve kernel: ? chacha_block_generic+0x6d/0xc0 Jul 09 18:36:49 pve kernel: ? _get_random_bytes+0xcf/0x1b0 Jul 09 18:36:49 pve kernel: do_filp_open+0xaf/0x170 Jul 09 18:36:49 pve kernel: do_sys_openat2+0xb3/0xe0 Jul 09 18:36:49 pve kernel: __x64_sys_openat+0x6c/0xa0 Jul 09 18:36:49 pve kernel: do_syscall_64+0x5b/0x90 Jul 09 18:36:49 pve kernel: ? do_symlinkat+0xd6/0x150 Jul 09 18:36:49 pve kernel: ? exit_to_user_mode_prepare+0x39/0x190 Jul 09 18:36:49 pve kernel: ? syscall_exit_to_user_mode+0x37/0x60 Jul 09 18:36:49 pve kernel: ? do_syscall_64+0x67/0x90 Jul 09 18:36:49 pve kernel: ? exit_to_user_mode_prepare+0x39/0x190 Jul 09 18:36:49 pve kernel: ? syscall_exit_to_user_mode+0x37/0x60 Jul 09 18:36:49 pve kernel: ? do_syscall_64+0x67/0x90 Jul 09 18:36:49 pve kernel: ? syscall_exit_to_user_mode+0x37/0x60 Jul 09 18:36:49 pve kernel: ? do_syscall_64+0x67/0x90 Jul 09 18:36:49 pve kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Jul 09 18:36:49 pve kernel: RIP: 0033:0x7f0423bedf01 Jul 09 18:36:49 pve kernel: RSP: 002b:00007fff7d7ef520 EFLAGS: 00000202 ORIG_RAX: 0000000000000101 Jul 09 18:36:49 pve kernel: RAX: ffffffffffffffda RBX: 0000000000080002 RCX: 00007f0423bedf01 Jul 09 18:36:49 pve kernel: RDX: 0000000000080002 RSI: 00007fff7d7ef5b0 RDI: 00000000ffffff9c Jul 09 18:36:49 pve kernel: RBP: 00007fff7d7ef5b0 R08: 0000000000000000 R09: 0000000000000064 Jul 09 18:36:49 pve kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff7d7ef660 Jul 09 18:36:49 pve kernel: R13: 00000000c1d00008 R14: 00000000d0040802 R15: 00000000c1d00008 Jul 09 18:36:49 pve kernel: </TASK> Jul 09 18:36:49 pve kernel: INFO: task nv_queue:1230 blocked for more than 120 seconds. Jul 09 18:36:49 pve kernel: Tainted: P OE 6.5.11-8-pve #1 Jul 09 18:36:49 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 09 18:36:49 pve kernel: task:nv_queue state:D stack:0 pid:1230 ppid:2 flags:0x00004000 Jul 09 18:36:49 pve kernel: Call Trace: Jul 09 18:36:49 pve kernel: <TASK> Jul 09 18:36:49 pve kernel: __schedule+0x3fd/0x1450 Jul 09 18:36:49 pve kernel: ? _nv010522rm+0xd0/0x250 [nvidia] Jul 09 18:36:49 pve kernel: schedule+0x63/0x110 Jul 09 18:36:49 pve kernel: schedule_timeout+0x157/0x170 Jul 09 18:36:49 pve kernel: __down_common+0x111/0x210 Jul 09 18:36:49 pve kernel: ? finish_task_switch.isra.0+0x85/0x2c0 Jul 09 18:36:49 pve kernel: __down+0x1d/0x30 Jul 09 18:36:49 pve kernel: down+0x54/0x80 Jul 09 18:36:49 pve kernel: os_acquire_mutex+0x3c/0x70 [nvidia] Jul 09 18:36:49 pve kernel: _nv042338rm+0x10/0x40 [nvidia] Jul 09 18:36:49 pve kernel: ? _nv013205rm+0x64d/0x7d0 [nvidia] Jul 09 18:36:49 pve kernel: ? _nv043295rm+0x122/0x180 [nvidia] Jul 09 18:36:49 pve kernel: ? _nv048990rm+0xeb/0x260 [nvidia] Jul 09 18:36:49 pve kernel: ? rm_execute_work_item+0x5e/0x130 [nvidia] Jul 09 18:36:49 pve kernel: ? os_execute_work_item+0x6c/0x90 [nvidia] Jul 09 18:36:49 pve kernel: ? _main_loop+0x82/0x140 [nvidia] Jul 09 18:36:49 pve kernel: ? __pfx__main_loop+0x10/0x10 [nvidia] Jul 09 18:36:49 pve kernel: ? kthread+0xf2/0x120 Jul 09 18:36:49 pve kernel: ? __pfx_kthread+0x10/0x10 Jul 09 18:36:49 pve kernel: ? ret_from_fork+0x47/0x70 Jul 09 18:36:49 pve kernel: ? __pfx_kthread+0x10/0x10 Jul 09 18:36:49 pve kernel: ? ret_from_fork_asm+0x1b/0x30 Jul 09 18:36:49 pve kernel: </TASK> Jul 09 18:36:49 pve kernel: INFO: task (agetty):1251 blocked for more than 120 seconds. Jul 09 18:36:49 pve kernel: Tainted: P OE 6.5.11-8-pve #1 Jul 09 18:36:49 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 09 18:36:49 pve kernel: task:(agetty) state:D stack:0 pid:1251 ppid:1 flags:0x00000002 Jul 09 18:36:49 pve kernel: Call Trace: Jul 09 18:36:49 pve kernel: <TASK> Jul 09 18:36:49 pve kernel: __schedule+0x3fd/0x1450 Jul 09 18:36:49 pve kernel: ? hrtimer_try_to_cancel+0x87/0x120 Jul 09 18:36:49 pve kernel: ? schedule_hrtimeout_range_clock+0xc4/0x130 Jul 09 18:36:49 pve kernel: schedule+0x63/0x110 Jul 09 18:36:49 pve kernel: schedule_timeout+0x157/0x170 Jul 09 18:36:49 pve kernel: __down_common+0x111/0x210 Jul 09 18:36:49 pve kernel: __down+0x1d/0x30 Jul 09 18:36:49 pve kernel: down+0x54/0x80 Jul 09 18:36:49 pve kernel: console_lock+0x25/0x80 Jul 09 18:36:49 pve kernel: con_install+0x21/0x130 Jul 09 18:36:49 pve kernel: tty_init_dev.part.0+0x4e/0x280 Jul 09 18:36:49 pve kernel: tty_open+0x48d/0x6f0 Jul 09 18:36:49 pve kernel: chrdev_open+0xcb/0x250 Jul 09 18:36:49 pve kernel: ? fsnotify_perm.part.0+0x83/0x200 Jul 09 18:36:49 pve kernel: ? __pfx_chrdev_open+0x10/0x10 Jul 09 18:36:49 pve kernel: do_dentry_open+0x220/0x530 Jul 09 18:36:49 pve kernel: vfs_open+0x33/0x50 Jul 09 18:36:49 pve kernel: path_openat+0xb1c/0x1180 Jul 09 18:36:49 pve kernel: do_filp_open+0xaf/0x170 Jul 09 18:36:49 pve kernel: do_sys_openat2+0xb3/0xe0 Jul 09 18:36:49 pve kernel: __x64_sys_openat+0x6c/0xa0 Jul 09 18:36:49 pve kernel: do_syscall_64+0x5b/0x90 Jul 09 18:36:49 pve kernel: ? irqentry_exit_to_user_mode+0x17/0x20 Jul 09 18:36:49 pve kernel: ? irqentry_exit+0x43/0x50 Jul 09 18:36:49 pve kernel: ? exc_page_fault+0x94/0x1b0 Jul 09 18:36:49 pve kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Jul 09 18:36:49 pve kernel: RIP: 0033:0x7f25a2116f80 Jul 09 18:36:49 pve kernel: RSP: 002b:00007ffdb0e85060 EFLAGS: 00000293 ORIG_RAX: 0000000000000101 Jul 09 18:36:49 pve kernel: RAX: ffffffffffffffda RBX: 0000000000080902 RCX: 00007f25a2116f80 Jul 09 18:36:49 pve kernel: RDX: 0000000000080902 RSI: 0000555822eee780 RDI: 00000000ffffff9c Jul 09 18:36:49 pve kernel: RBP: 0000555822eee780 R08: 0000000000000000 R09: 00007ffdb0e85150 Jul 09 18:36:49 pve kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000080902 Jul 09 18:36:49 pve kernel: R13: 0000555822eee780 R14: 00007ffdb0e85680 R15: 0000555822ee9510 Jul 09 18:36:49 pve kernel: </TASK> Jul 09 18:37:13 pve pvedaemon[1484]: <root@pam> successful auth for user 'root@pam' Jul 09 18:38:50 pve kernel: INFO: task nvidia-vgpud:1123 blocked for more than 241 seconds. Jul 09 18:38:50 pve kernel: Tainted: P OE 6.5.11-8-pve #1 Jul 09 18:38:50 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 09 18:38:50 pve kernel: task:nvidia-vgpud state:D stack:0 pid:1123 ppid:1 flags:0x00000002 Jul 09 18:38:50 pve kernel: Call Trace: Jul 09 18:38:50 pve kernel: <TASK> Jul 09 18:38:50 pve kernel: __schedule+0x3fd/0x1450 Jul 09 18:38:50 pve kernel: ? __kmem_cache_alloc_node+0x1aa/0x360 Jul 09 18:38:50 pve kernel: ? os_alloc_mem+0xdd/0x100 [nvidia] Jul 09 18:38:50 pve kernel: schedule+0x63/0x110 Jul 09 18:38:50 pve kernel: schedule_timeout+0x157/0x170 Jul 09 18:38:50 pve kernel: __down_common+0x111/0x210 Jul 09 18:38:50 pve kernel: __down+0x1d/0x30 Jul 09 18:38:50 pve kernel: down+0x54/0x80 Jul 09 18:38:50 pve kernel: nvidia_frontend_open+0x29/0xb0 [nvidia] Jul 09 18:38:50 pve kernel: chrdev_open+0xcb/0x250 Jul 09 18:38:50 pve kernel: ? fsnotify_perm.part.0+0x83/0x200 Jul 09 18:38:50 pve kernel: ? __pfx_chrdev_open+0x10/0x10 Jul 09 18:38:50 pve kernel: do_dentry_open+0x220/0x530 Jul 09 18:38:50 pve kernel: vfs_open+0x33/0x50 Jul 09 18:38:50 pve kernel: path_openat+0xb1c/0x1180 Jul 09 18:38:50 pve kernel: ? chacha_block_generic+0x6d/0xc0 Jul 09 18:38:50 pve kernel: ? _get_random_bytes+0xcf/0x1b0 Jul 09 18:38:50 pve kernel: do_filp_open+0xaf/0x170 Jul 09 18:38:50 pve kernel: do_sys_openat2+0xb3/0xe0 Jul 09 18:38:50 pve kernel: __x64_sys_openat+0x6c/0xa0 Jul 09 18:38:50 pve kernel: do_syscall_64+0x5b/0x90 Jul 09 18:38:50 pve kernel: ? do_symlinkat+0xd6/0x150 Jul 09 18:38:50 pve kernel: ? exit_to_user_mode_prepare+0x39/0x190 Jul 09 18:38:50 pve kernel: ? syscall_exit_to_user_mode+0x37/0x60 Jul 09 18:38:50 pve kernel: ? do_syscall_64+0x67/0x90 Jul 09 18:38:50 pve kernel: ? exit_to_user_mode_prepare+0x39/0x190 Jul 09 18:38:50 pve kernel: ? syscall_exit_to_user_mode+0x37/0x60 Jul 09 18:38:50 pve kernel: ? do_syscall_64+0x67/0x90 Jul 09 18:38:50 pve kernel: ? syscall_exit_to_user_mode+0x37/0x60 Jul 09 18:38:50 pve kernel: ? do_syscall_64+0x67/0x90 Jul 09 18:38:50 pve kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Jul 09 18:38:50 pve kernel: RIP: 0033:0x7f0423bedf01 Jul 09 18:38:50 pve kernel: RSP: 002b:00007fff7d7ef520 EFLAGS: 00000202 ORIG_RAX: 0000000000000101 Jul 09 18:38:50 pve kernel: RAX: ffffffffffffffda RBX: 0000000000080002 RCX: 00007f0423bedf01 Jul 09 18:38:50 pve kernel: RDX: 0000000000080002 RSI: 00007fff7d7ef5b0 RDI: 00000000ffffff9c Jul 09 18:38:50 pve kernel: RBP: 00007fff7d7ef5b0 R08: 0000000000000000 R09: 0000000000000064 Jul 09 18:38:50 pve kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff7d7ef660 Jul 09 18:38:50 pve kernel: R13: 00000000c1d00008 R14: 00000000d0040802 R15: 00000000c1d00008 Jul 09 18:38:50 pve kernel: </TASK> Jul 09 18:38:50 pve kernel: INFO: task nv_queue:1230 blocked for more than 241 seconds. Jul 09 18:38:50 pve kernel: Tainted: P OE 6.5.11-8-pve #1 Jul 09 18:38:50 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 09 18:38:50 pve kernel: task:nv_queue state:D stack:0 pid:1230 ppid:2 flags:0x00004000 Jul 09 18:38:50 pve kernel: Call Trace: Jul 09 18:38:50 pve kernel: <TASK> Jul 09 18:38:50 pve kernel: __schedule+0x3fd/0x1450 Jul 09 18:38:50 pve kernel: ? _nv010522rm+0xd0/0x250 [nvidia] Jul 09 18:38:50 pve kernel: schedule+0x63/0x110 Jul 09 18:38:50 pve kernel: schedule_timeout+0x157/0x170 Jul 09 18:38:50 pve kernel: __down_common+0x111/0x210 Jul 09 18:38:50 pve kernel: ? finish_task_switch.isra.0+0x85/0x2c0 Jul 09 18:38:50 pve kernel: __down+0x1d/0x30 Jul 09 18:38:50 pve kernel: down+0x54/0x80 Jul 09 18:38:50 pve kernel: os_acquire_mutex+0x3c/0x70 [nvidia] Jul 09 18:38:50 pve kernel: _nv042338rm+0x10/0x40 [nvidia] Jul 09 18:38:50 pve kernel: ? _nv013205rm+0x64d/0x7d0 [nvidia] Jul 09 18:38:50 pve kernel: ? _nv043295rm+0x122/0x180 [nvidia] ................................
- nvidia-vgpu-mgr server use more than one hour to start
Code:Jul 09 18:32:52 pve systemd[1]: Started nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon. ..... Jul 09 19:43:19 pve nvidia-vgpu-mgr[1118]: notice: vmiop_env_log: nvidia-vgpu-mgr daemon started
I need help,anyone knows the reasons that stuck the vGPU installing?
Thanks