[SOLVED] "unable to handle page fault" with 5.3.10 kernel but no problem with 5.0.x - how to preserve the old working kernel on updates?

Asano

Well-Known Member
Jan 27, 2018
55
10
48
44
When I boot with my recent kernel (5.3.10) I cannot start my VM which gets a RTX 2080 SUPER passed through (I'll attach the full log with error below). Another VM which gets a GT 1030 passed through still works normal.

However, when I select the previous kernel (5.0.x) from the boot menu everything works fine. So this is what I did since the last kernel update.
But now there is a new kernel update available (5.3.13-1) and I'm worried to install it and have a few questions:

1) It looks like Proxmox only keeps the previous kernel available. So I fear I could end up with 5.3.10 and 5.3.13 but 5.0.x gone. That would be bad if the bug would still exist in 5.3.13. How can I ensure that Proxmox will keep the 5.0.x kernel including its boot menu entry?

2) If I would try the update and mess things up is there an easy way to get the old kernel back? Note that I'm on ZFS so maybe simply a zfs snap rpool/ROOT/pve-1@beforeUpdate and a rollback + restart in case things go south would work? But somehow I doubt it since the bootloader would have to get updated as well or?

3) In case I have to stick with an old kernel, where do I set its entry in the bootloader as default? Currently I select it manually but that's a bit annoying if it would be required longer...

I'm guessing most can be answered by the Wiki but I couldn't really find the correct entries. So any links or tips are very welcome! :)

Code:
Dec 10 20:27:54 pve3 kernel: BUG: unable to handle page fault for address: ffffa38900111000
Dec 10 20:27:54 pve3 kernel: #PF: supervisor read access in kernel mode
Dec 10 20:27:54 pve3 kernel: #PF: error_code(0x0000) - not-present page
Dec 10 20:27:54 pve3 kernel: PGD ff8554067 P4D ff8554067 PUD ff8555067 PMD ff8556067 PTE 0
Dec 10 20:27:54 pve3 kernel: Oops: 0000 [#1] SMP NOPTI
Dec 10 20:27:54 pve3 kernel: CPU: 12 PID: 1313 Comm: kworker/12:2 Tainted: P           O      5.3.10-1-pve #1
Dec 10 20:27:54 pve3 kernel: Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F10 11/15/2019
Dec 10 20:27:54 pve3 kernel: Workqueue: events ccg_pm_workaround_work [ucsi_ccg]
Dec 10 20:27:54 pve3 kernel: RIP: 0010:gpu_i2c_check_status.isra.5+0x52/0xa0 [i2c_nvidia_gpu]
Dec 10 20:27:54 pve3 kernel: Code: 25 00 00 00 60 3d 00 00 00 60 75 24 be 58 02 00 00 bf f4 01 00 00 e8 cd 16 9e ca 48 8b 05 c6 1c 52 cb 4c 39 e0 79 09 49 8b 06 <8b> 18 85 db 78 ce 48 8b 05 b1 1c 52 cb 49 39 c4 0f 88 b6 02 00 00
Dec 10 20:27:54 pve3 kernel: RSP: 0018:ffffa389013ebcc0 EFLAGS: 00010293
Dec 10 20:27:54 pve3 kernel: RAX: ffffa38900111000 RBX: 00000000ffffffff RCX: 0000000000000000
Dec 10 20:27:54 pve3 kernel: RDX: ffff95903eb1db40 RSI: 0000000000000000 RDI: ffffa389013ebbf8
Dec 10 20:27:54 pve3 kernel: RBP: ffffa389013ebce0 R08: 0000000000000000 R09: 0000000000000003
Dec 10 20:27:54 pve3 kernel: R10: 000000000000000e R11: ffff95903eb294c4 R12: 00000001000115fd
Dec 10 20:27:54 pve3 kernel: R13: ffff95902a98f018 R14: ffff95902a98f020 R15: 0000000100011503
Dec 10 20:27:54 pve3 kernel: FS:  0000000000000000(0000) GS:ffff95903eb00000(0000) knlGS:0000000000000000
Dec 10 20:27:54 pve3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 10 20:27:54 pve3 kernel: CR2: ffffa38900111000 CR3: 0000000f86804000 CR4: 0000000000340ee0
Dec 10 20:27:54 pve3 kernel: Call Trace:
Dec 10 20:27:54 pve3 kernel:  gpu_i2c_master_xfer+0xe8/0x22d [i2c_nvidia_gpu]
Dec 10 20:27:54 pve3 kernel:  __i2c_transfer+0x180/0x4d0
Dec 10 20:27:54 pve3 kernel:  i2c_transfer+0x88/0x100
Dec 10 20:27:54 pve3 kernel:  ccg_read+0x11e/0x170 [ucsi_ccg]
Dec 10 20:27:54 pve3 kernel:  ? __switch_to_asm+0x40/0x70
Dec 10 20:27:54 pve3 kernel:  ? __switch_to_asm+0x40/0x70
Dec 10 20:27:54 pve3 kernel:  ucsi_ccg_sync+0x56/0xb0 [ucsi_ccg]
Dec 10 20:27:54 pve3 kernel:  ucsi_notify+0x26/0x120 [typec_ucsi]
Dec 10 20:27:54 pve3 kernel:  ccg_pm_workaround_work+0x15/0x20 [ucsi_ccg]
Dec 10 20:27:54 pve3 kernel:  process_one_work+0x20f/0x3d0
Dec 10 20:27:54 pve3 kernel:  worker_thread+0x34/0x400
Dec 10 20:27:54 pve3 kernel:  kthread+0x120/0x140
Dec 10 20:27:54 pve3 kernel:  ? process_one_work+0x3d0/0x3d0
Dec 10 20:27:54 pve3 kernel:  ? __kthread_parkme+0x70/0x70
Dec 10 20:27:54 pve3 kernel:  ret_from_fork+0x22/0x40
Dec 10 20:27:54 pve3 kernel: Modules linked in: md4 cmac nls_utf8 cifs libarc4 fscache ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter bonding edac_mce_amd kvm_amd kvm softdog crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio aesni_intel aes_x86_64 crypto_simd cryptd glue_helper uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common wmi_bmof mxm_wmi pcspkr snd_hda_intel snd_usb_audio snd_hda_codec videodev snd_usbmidi_lib k10temp snd_hda_core snd_rawmidi snd_seq_device snd_hwdep mc snd_pcm ucsi_ccg vhost_net typec_ucsi nfnetlink_log snd_timer vhost tap typec ccp snd nfnetlink ib_iser soundcore rdma_cm iw_cm ib_cm ib_core iscsi_tcp joydev libiscsi_tcp input_leds libiscsi scsi_transport_iscsi mac_hid sunrpc vfio_pci vfio_virqfd irqbypass vfio_iommu_type1 vfio ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor
Dec 10 20:27:54 pve3 kernel:  zstd_compress raid6_pq libcrc32c hid_logitech_hidpp hid_logitech_dj usbmouse hid_generic usbkbd usbhid hid i2c_piix4 i2c_nvidia_gpu ahci libahci igb i2c_algo_bit ixgbe xfrm_algo dca mdio wmi
Dec 10 20:27:54 pve3 kernel: CR2: ffffa38900111000
Dec 10 20:27:54 pve3 kernel: ---[ end trace 00cfd3a4f5498663 ]---
Dec 10 20:27:54 pve3 kernel: RIP: 0010:gpu_i2c_check_status.isra.5+0x52/0xa0 [i2c_nvidia_gpu]
Dec 10 20:27:54 pve3 kernel: Code: 25 00 00 00 60 3d 00 00 00 60 75 24 be 58 02 00 00 bf f4 01 00 00 e8 cd 16 9e ca 48 8b 05 c6 1c 52 cb 4c 39 e0 79 09 49 8b 06 <8b> 18 85 db 78 ce 48 8b 05 b1 1c 52 cb 49 39 c4 0f 88 b6 02 00 00
Dec 10 20:27:54 pve3 kernel: RSP: 0018:ffffa389013ebcc0 EFLAGS: 00010293
Dec 10 20:27:54 pve3 kernel: RAX: ffffa38900111000 RBX: 00000000ffffffff RCX: 0000000000000000
Dec 10 20:27:54 pve3 kernel: RDX: ffff95903eb1db40 RSI: 0000000000000000 RDI: ffffa389013ebbf8
Dec 10 20:27:54 pve3 kernel: RBP: ffffa389013ebce0 R08: 0000000000000000 R09: 0000000000000003
Dec 10 20:27:54 pve3 kernel: R10: 000000000000000e R11: ffff95903eb294c4 R12: 00000001000115fd
Dec 10 20:27:54 pve3 kernel: R13: ffff95902a98f018 R14: ffff95902a98f020 R15: 0000000100011503
Dec 10 20:27:54 pve3 kernel: FS:  0000000000000000(0000) GS:ffff95903eb00000(0000) knlGS:0000000000000000
Dec 10 20:27:54 pve3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 10 20:27:54 pve3 kernel: CR2: ffffa38900111000 CR3: 0000000f86804000 CR4: 0000000000340ee0
Dec 10 20:27:54 pve3 systemd[1]: Started 102.scope.
Dec 10 20:27:55 pve3 systemd-udevd[12067]: Using default interface naming scheme 'v240'.
Dec 10 20:27:55 pve3 systemd-udevd[12067]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Dec 10 20:27:55 pve3 systemd-udevd[12067]: Could not generate persistent MAC address for tap102i0: No such file or directory
Dec 10 20:27:55 pve3 kernel: device tap102i0 entered promiscuous mode
Dec 10 20:27:55 pve3 kernel: vmbr0: port 3(tap102i0) entered blocking state
Dec 10 20:27:55 pve3 kernel: vmbr0: port 3(tap102i0) entered disabled state
Dec 10 20:27:55 pve3 kernel: vmbr0: port 3(tap102i0) entered blocking state
Dec 10 20:27:55 pve3 kernel: vmbr0: port 3(tap102i0) entered forwarding state
Dec 10 20:27:57 pve3 kernel: vfio-pci 0000:0b:00.0: enabling device (0000 -> 0003)
Dec 10 20:27:57 pve3 kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
Dec 10 20:27:57 pve3 kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
 
Last edited:
  • Like
Reactions: spicyisland
Hi
1) It looks like Proxmox only keeps the previous kernel available. So I fear I could end up with 5.3.10 and 5.3.13 but 5.0.x gone. That would be bad if the bug would still exist in 5.3.13. How can I ensure that Proxmox will keep the 5.0.x kernel including its boot menu entry?
You can always install any kernel you like.
Code:
apt install pve-kernel-5.0.21-5-pve

2) If I would try the update and mess things up is there an easy way to get the old kernel back? Note that I'm on ZFS so maybe simply a zfs snap rpool/ROOT/pve-1@beforeUpdate and a rollback + restart in case things go south would work? But somehow I doubt it since the bootloader would have to get updated as well or?
No just install the kernel and at boot time you can choose the correct kernel.

3) In case I have to stick with an old kernel, where do I set its entry in the bootloader as default? Currently I select it manually but that's a bit annoying if it would be required longer...
You can edit the /etc/default/grub file
just replace

GRUB_DEFAULT=0

to

GRUB_DEFAULT="1><NO of kernel image>"

or you use the image id for GRUB_DEFAULT

more information about this see
Code:
 info -f grub -n 'Simple configuration'
 
When I boot with my recent kernel (5.3.10) I cannot start my VM which gets a RTX 2080 SUPER passed through (I'll attach the full log with error below). Another VM which gets a GT 1030 passed through still works normal.

However, when I select the previous kernel (5.0.x) from the boot menu everything works fine. So this is what I did since the last kernel update.
But now there is a new kernel update available (5.3.13-1) and I'm worried to install it and have a few questions:

1) It looks like Proxmox only keeps the previous kernel available. So I fear I could end up with 5.3.10 and 5.3.13 but 5.0.x gone. That would be bad if the bug would still exist in 5.3.13. How can I ensure that Proxmox will keep the 5.0.x kernel including its boot menu entry?

2) If I would try the update and mess things up is there an easy way to get the old kernel back? Note that I'm on ZFS so maybe simply a zfs snap rpool/ROOT/pve-1@beforeUpdate and a rollback + restart in case things go south would work? But somehow I doubt it since the bootloader would have to get updated as well or?

3) In case I have to stick with an old kernel, where do I set its entry in the bootloader as default? Currently I select it manually but that's a bit annoying if it would be required longer...

I'm guessing most can be answered by the Wiki but I couldn't really find the correct entries. So any links or tips are very welcome! :)

Hi! My setup is simmilar to you and I get same errors with 5.3.10 kernel.
I'm using Gigabyte X570 auros elite and I cannot start my VM which gets a RTX 2070 SUPER passed through.
I'll attach the full log with error below too.

You can change your default kernel entry by following this article.
https://unix.stackexchange.com/questions/198003/set-default-kernel-in-grub


[ 572.327056] xhci_hcd 0000:09:00.2: remove, state 4
[ 572.327059] usb usb6: USB disconnect, device number 1
[ 572.327461] xhci_hcd 0000:09:00.2: USB bus 6 deregistered
[ 572.327465] xhci_hcd 0000:09:00.2: remove, state 4
[ 572.327466] usb usb5: USB disconnect, device number 1
[ 572.328089] xhci_hcd 0000:09:00.2: USB bus 5 deregistered
[ 572.569714] BUG: unable to handle page fault for address: ffffba9200131000
[ 572.571928] #PF: supervisor read access in kernel mode
[ 572.572897] #PF: error_code(0x0000) - not-present page
[ 572.573860] PGD 81b154067 P4D 81b154067 PUD 81b155067 PMD 81b156067 PTE 0
[ 572.574833] Oops: 0000 [#1] SMP NOPTI
[ 572.575798] CPU: 3 PID: 112 Comm: kworker/3:2 Tainted: P O 5.3.10-1-pve #1
[ 572.576763] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F10 11/15/2019
[ 572.577708] Workqueue: events ccg_pm_workaround_work [ucsi_ccg]
[ 572.578660] RIP: 0010:gpu_i2c_check_status.isra.5+0x52/0xa0 [i2c_nvidia_gpu]
[ 572.579606] Code: 25 00 00 00 60 3d 00 00 00 60 75 24 be 58 02 00 00 bf f4 01 00 00 e8 cd 16 9a e7 48 8b 05 c6 1c 4e e8 4c 39 e0 79 09 49 8b 06 <8b> 18 85 db 78 ce 48 8b 05 b1 1c 4e e8 49 39 c4 0f 88 b6 02 00 00
[ 572.580591] RSP: 0018:ffffba920058bcc0 EFLAGS: 00010283
[ 572.581571] RAX: ffffba9200131000 RBX: 00000000ffffffff RCX: 0000000000000000
[ 572.582553] RDX: ffff9cee5e6ddb40 RSI: 0000000000000000 RDI: ffffba920058bbf8
[ 572.583527] RBP: ffffba920058bce0 R08: 0000000000000000 R09: 0000000000000003
[ 572.584494] R10: 000000000000000e R11: ffff9cee5e6e94c4 R12: 0000000100010af6
[ 572.585454] R13: ffff9cee5abe5818 R14: ffff9cee5abe5820 R15: 00000001000109fc
[ 572.586408] FS: 0000000000000000(0000) GS:ffff9cee5e6c0000(0000) knlGS:0000000000000000
[ 572.587358] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 572.588300] CR2: ffffba9200131000 CR3: 00000008034f6000 CR4: 0000000000340ee0
[ 572.589243] Call Trace:
[ 572.590204] gpu_i2c_master_xfer+0xe8/0x22d [i2c_nvidia_gpu]
[ 572.591152] __i2c_transfer+0x180/0x4d0
[ 572.592087] i2c_transfer+0x88/0x100
[ 572.593011] ccg_read+0x11e/0x170 [ucsi_ccg]
[ 572.593940] ? __switch_to_asm+0x40/0x70
[ 572.594862] ? __switch_to_asm+0x40/0x70
[ 572.595772] ucsi_ccg_sync+0x56/0xb0 [ucsi_ccg]
[ 572.596681] ucsi_notify+0x26/0x120 [typec_ucsi]
[ 572.597591] ccg_pm_workaround_work+0x15/0x20 [ucsi_ccg]
[ 572.598507] process_one_work+0x20f/0x3d0
[ 572.599412] worker_thread+0x34/0x400
[ 572.600313] kthread+0x120/0x140
[ 572.601213] ? process_one_work+0x3d0/0x3d0
[ 572.602111] ? __kthread_parkme+0x70/0x70
[ 572.603004] ret_from_fork+0x22/0x40
[ 572.603894] Modules linked in: tcp_diag inet_diag ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter bonding softdog nfnetlink_log nfnetlink edac_mce_amd kvm_amd kvm zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_generic amdgpu ledtrig_audio aesni_intel snd_hda_codec_hdmi amd_iommu_v2 aes_x86_64 gpu_sched crypto_simd ttm cryptd glue_helper snd_hda_intel snd_usb_audio drm_kms_helper wmi_bmof snd_usbmidi_lib k10temp ucsi_ccg snd_hda_codec snd_rawmidi xpad snd_hda_core snd_seq_device drm typec_ucsi ff_memless mc snd_hwdep typec fb_sys_fops joydev input_leds syscopyarea snd_pcm ccp zcommon(PO) sysfillrect snd_timer sysimgblt snd znvpair(PO) soundcore spl(O) vhost_net vhost tap ib_iser mac_hid rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_virqfd irqbypass vfio_iommu_type1 vfio sunrpc ip_tables x_tables autofs4 btrfs xor
[ 572.603918] zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c usbmouse hid_generic usbkbd usbhid hid uas usb_storage i2c_piix4 i2c_nvidia_gpu ahci libahci igb i2c_algo_bit dca wmi
[ 572.610024] CR2: ffffba9200131000
[ 572.611095] ---[ end trace 2e2ef1650bea31c1 ]---
[ 572.612159] RIP: 0010:gpu_i2c_check_status.isra.5+0x52/0xa0 [i2c_nvidia_gpu]
[ 572.613220] Code: 25 00 00 00 60 3d 00 00 00 60 75 24 be 58 02 00 00 bf f4 01 00 00 e8 cd 16 9a e7 48 8b 05 c6 1c 4e e8 4c 39 e0 79 09 49 8b 06 <8b> 18 85 db 78 ce 48 8b 05 b1 1c 4e e8 49 39 c4 0f 88 b6 02 00 00
[ 572.614320] RSP: 0018:ffffba920058bcc0 EFLAGS: 00010283
[ 572.615416] RAX: ffffba9200131000 RBX: 00000000ffffffff RCX: 0000000000000000
[ 572.616520] RDX: ffff9cee5e6ddb40 RSI: 0000000000000000 RDI: ffffba920058bbf8
[ 572.617617] RBP: ffffba920058bce0 R08: 0000000000000000 R09: 0000000000000003
[ 572.618704] R10: 000000000000000e R11: ffff9cee5e6e94c4 R12: 0000000100010af6
[ 572.619790] R13: ffff9cee5abe5818 R14: ffff9cee5abe5820 R15: 00000001000109fc
[ 572.620865] FS: 0000000000000000(0000) GS:ffff9cee5e6c0000(0000) knlGS:0000000000000000
[ 572.621943] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 572.623013] CR2: ffffba9200131000 CR3: 00000008034f6000 CR4: 0000000000340ee0
[ 572.829227] device tap100i0 entered promiscuous mode
[ 572.843851] fwbr100i0: port 1(fwln100i0) entered blocking state
[ 572.844973] fwbr100i0: port 1(fwln100i0) entered disabled state
[ 572.846125] device fwln100i0 entered promiscuous mode
[ 572.847253] fwbr100i0: port 1(fwln100i0) entered blocking state
[ 572.848350] fwbr100i0: port 1(fwln100i0) entered forwarding state
[ 572.851107] vmbr0: port 2(fwpr100p0) entered blocking state
[ 572.852165] vmbr0: port 2(fwpr100p0) entered disabled state
[ 572.853265] device fwpr100p0 entered promiscuous mode
[ 572.854339] vmbr0: port 2(fwpr100p0) entered blocking state
[ 572.855369] vmbr0: port 2(fwpr100p0) entered forwarding state
[ 572.858186] fwbr100i0: port 2(tap100i0) entered blocking state
[ 572.859207] fwbr100i0: port 2(tap100i0) entered disabled state
[ 572.860250] fwbr100i0: port 2(tap100i0) entered blocking state
[ 572.861235] fwbr100i0: port 2(tap100i0) entered forwarding state
[ 573.214290] vfio-pci 0000:09:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[ 573.215519] vfio-pci 0000:09:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
 
@spicyisland nice to know that there are more people with a similar setup :)

@wolfgang thanks, that gave me confidence to try the new 5.3.13-1-pve kernel. However due to efi boot things seem a little bit different on my system.

But first of all 5.3.13-1-pve still has a similar problem (log below). This time trying to boot the VM also crashed my host. So 5.0.21-5-pve still is the latest working version. Is there any more information about the actual problem? Maybe there would be a workaround so the RTX 20xx passthrough would also work with the 5.3.x kernels?

And regarding the boot options I was able to get the old kernel back with
Code:
pve-efiboot-tool kernel add 5.0.21-5-pve
pve-efiboot-tool refresh
Now the kernel is in the "Manually selected kernels" list. However 5.3.13-1-pve still is the default and pve-efiboot-tool kernel remove 5.3.13-1-pve doesn't work since it is in the "Automatically selected kernels" list. I couldn't find a way to set the old kernel as default with pve-efiboot-tool, is there one?

Dec 12 14:57:28 pve3 kernel: [ 478.511644] BUG: unable to handle page fault for address: ffffa74840111000
Dec 12 14:57:28 pve3 kernel: [ 478.511649] #PF: supervisor read access in kernel mode
Dec 12 14:57:28 pve3 kernel: [ 478.511650] #PF: error_code(0x0000) - not-present page
Dec 12 14:57:28 pve3 kernel: [ 478.511651] PGD ff8154067 P4D ff8154067 PUD ff8155067 PMD ff8156067 PTE 0
Dec 12 14:57:28 pve3 kernel: [ 478.511654] Oops: 0000 [#1] SMP NOPTI
Dec 12 14:57:28 pve3 kernel: [ 478.511656] CPU: 9 PID: 9910 Comm: kworker/9:0 Tainted: P O 5.3.13-1-pve #1
Dec 12 14:57:28 pve3 kernel: [ 478.511658] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F10 11/15/2019
Dec 12 14:57:28 pve3 kernel: [ 478.511662] Workqueue: events ccg_pm_workaround_work [ucsi_ccg]
Dec 12 14:57:28 pve3 kernel: [ 478.511665] RIP: 0010:gpu_i2c_check_status.isra.5+0x52/0xa0 [i2c_nvidia_gpu]
Dec 12 14:57:28 pve3 kernel: [ 478.511667] Code: 25 00 00 00 60 3d 00 00 00 60 75 24 be 58 02 00 00 bf f4 01 00 00 e8 ed 9b 97 d3 48 8b 05 c6 9c 4b d4 4c 39 e0 79 09 49 8b 06 <8b> 18 85 db 78 ce 48 8b 05 b1 9c 4b d4 49 39 c4 0f 88 b6 02 00 00
Dec 12 14:57:28 pve3 kernel: [ 478.511669] RSP: 0018:ffffa7486c827cc0 EFLAGS: 00010293
Dec 12 14:57:28 pve3 kernel: [ 478.511671] RAX: ffffa74840111000 RBX: 00000000ffffffff RCX: 0000000000000000
Dec 12 14:57:28 pve3 kernel: [ 478.511672] RDX: ffff9307be85db40 RSI: 0000000000000000 RDI: ffffa7486c827bf8
Dec 12 14:57:28 pve3 kernel: [ 478.511673] RBP: ffffa7486c827ce0 R08: 0000000000000000 R09: 0000000000000001
Dec 12 14:57:28 pve3 kernel: [ 478.511675] R10: 0000000000de98ab R11: 0000000000000000 R12: 000000010000af09
Dec 12 14:57:28 pve3 kernel: [ 478.511676] R13: ffff9307b7486018 R14: ffff9307b7486020 R15: 000000010000ae0f
Dec 12 14:57:28 pve3 kernel: [ 478.511677] FS: 0000000000000000(0000) GS:ffff9307be840000(0000) knlGS:0000000000000000
Dec 12 14:57:28 pve3 kernel: [ 478.511679] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 12 14:57:28 pve3 kernel: [ 478.511680] CR2: ffffa74840111000 CR3: 0000000f84236000 CR4: 0000000000340ee0
Dec 12 14:57:28 pve3 kernel: [ 478.511681] Call Trace:
Dec 12 14:57:28 pve3 kernel: [ 478.511684] gpu_i2c_master_xfer+0xe8/0x22d [i2c_nvidia_gpu]
Dec 12 14:57:28 pve3 kernel: [ 478.511688] __i2c_transfer+0x180/0x4d0
Dec 12 14:57:28 pve3 kernel: [ 478.511690] i2c_transfer+0x88/0x100
Dec 12 14:57:28 pve3 kernel: [ 478.511691] ccg_read+0x11e/0x170 [ucsi_ccg]
Dec 12 14:57:28 pve3 kernel: [ 478.511694] ? __switch_to_asm+0x40/0x70
Dec 12 14:57:28 pve3 kernel: [ 478.511696] ? __switch_to_asm+0x40/0x70
Dec 12 14:57:28 pve3 kernel: [ 478.511697] ucsi_ccg_sync+0x56/0xb0 [ucsi_ccg]
Dec 12 14:57:28 pve3 kernel: [ 478.511700] ucsi_notify+0x26/0x120 [typec_ucsi]
Dec 12 14:57:28 pve3 kernel: [ 478.511701] ccg_pm_workaround_work+0x15/0x20 [ucsi_ccg]
Dec 12 14:57:28 pve3 kernel: [ 478.511704] process_one_work+0x20f/0x3d0
Dec 12 14:57:28 pve3 kernel: [ 478.511706] worker_thread+0x34/0x400
Dec 12 14:57:28 pve3 kernel: [ 478.511707] kthread+0x120/0x140
Dec 12 14:57:28 pve3 kernel: [ 478.511709] ? process_one_work+0x3d0/0x3d0
Dec 12 14:57:28 pve3 kernel: [ 478.511710] ? __kthread_parkme+0x70/0x70
Dec 12 14:57:28 pve3 kernel: [ 478.511712] ret_from_fork+0x22/0x40
Dec 12 14:57:28 pve3 kernel: [ 478.511713] Modules linked in: veth md4 cmac nls_utf8 cifs libarc4 fscache ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables iptable_raw xt_mac ipt_REJECT nf_reject_ipv4 xt_physdev xt_addrtype xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp xt_comment xt_set xt_mark ip_set_hash_net ip_set iptable_filter bpfilter bonding edac_mce_amd kvm_amd softdog kvm nfnetlink_log nfnetlink crct10dif_pclmul crc32_pclmul ghash_clmulni_intel vhost_net vhost tap ib_iser rdma_cm iw_cm snd_hda_codec_realtek ib_cm snd_hda_codec_generic aesni_intel ledtrig_audio ib_core uvcvideo videobuf2_vmalloc aes_x86_64 crypto_simd videobuf2_memops iscsi_tcp cryptd libiscsi_tcp videobuf2_v4l2 glue_helper libiscsi videobuf2_common snd_hda_intel snd_usb_audio pcspkr snd_usbmidi_lib mxm_wmi videodev wmi_bmof snd_hda_codec snd_rawmidi snd_seq_device scsi_transport_iscsi snd_hda_core mc k10temp snd_hwdep snd_pcm snd_timer snd soundcore ucsi_ccg typec_ucsi
Dec 12 14:57:28 pve3 kernel: [ 478.511738] ccp typec joydev input_leds mac_hid vfio_pci vfio_virqfd irqbypass vfio_iommu_type1 vfio sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c hid_logitech_hidpp hid_logitech_dj usbmouse hid_generic usbkbd usbhid hid i2c_piix4 i2c_nvidia_gpu ahci libahci igb ixgbe i2c_algo_bit xfrm_algo dca mdio wmi
Dec 12 14:57:28 pve3 kernel: [ 478.511770] CR2: ffffa74840111000
Dec 12 14:57:28 pve3 kernel: [ 478.511773] ---[ end trace 18173bff6b8a442e ]---
Dec 12 14:57:28 pve3 kernel: [ 478.511775] RIP: 0010:gpu_i2c_check_status.isra.5+0x52/0xa0 [i2c_nvidia_gpu]
Dec 12 14:57:28 pve3 kernel: [ 478.511777] Code: 25 00 00 00 60 3d 00 00 00 60 75 24 be 58 02 00 00 bf f4 01 00 00 e8 ed 9b 97 d3 48 8b 05 c6 9c 4b d4 4c 39 e0 79 09 49 8b 06 <8b> 18 85 db 78 ce 48 8b 05 b1 9c 4b d4 49 39 c4 0f 88 b6 02 00 00
Dec 12 14:57:28 pve3 kernel: [ 478.511780] RSP: 0018:ffffa7486c827cc0 EFLAGS: 00010293
Dec 12 14:57:28 pve3 kernel: [ 478.511781] RAX: ffffa74840111000 RBX: 00000000ffffffff RCX: 0000000000000000
Dec 12 14:57:28 pve3 kernel: [ 478.511782] RDX: ffff9307be85db40 RSI: 0000000000000000 RDI: ffffa7486c827bf8
Dec 12 14:57:28 pve3 kernel: [ 478.511783] RBP: ffffa7486c827ce0 R08: 0000000000000000 R09: 0000000000000001
Dec 12 14:57:28 pve3 kernel: [ 478.511785] R10: 0000000000de98ab R11: 0000000000000000 R12: 000000010000af09
Dec 12 14:57:28 pve3 kernel: [ 478.511786] R13: ffff9307b7486018 R14: ffff9307b7486020 R15: 000000010000ae0f
Dec 12 14:57:28 pve3 kernel: [ 478.511787] FS: 0000000000000000(0000) GS:ffff9307be840000(0000) knlGS:0000000000000000
Dec 12 14:57:28 pve3 kernel: [ 478.511789] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 12 14:57:28 pve3 kernel: [ 478.511790] CR2: ffffa74840111000 CR3: 0000000f84236000 CR4: 0000000000340ee0
 
You must set this manually.
mount the efi partition and then you can set der default in the <mp>/loader/loader.conf
This setting is not persistent and will be overwritten if you update the kernel.
 
The current kernel update (5.3.13-2-pve) had similar issues for the VM with the RTX 2080 SUPER passthrough.
Only now an USB controller is involved but that probably is the case because meanwhile I added USB controller passthrough and stopped using the built in USB feature with the visualized USB bus (that gave me issues for some passed through USB devices). However the VM with the GT 1030 (which is the system's primary/bios GPU) passthrough still worked though USB felt a little bit off.

5.0.21-5-pve works with all updates and this setup. And from the last weeks I can say it also seems very stable (zero crashes or unexpected device behavior) and the VMs run at near native performance (benchmarks comparable to bare metal install with the same hardware and if input/output delay regarding USB and video is added it is below 5ms since I cannot measure it with my equipment).

That's neat but now I'm only wondering what to do about this kernel issue. Can I just keep updating other packages and stick to the old kernel or should I stop updating at all at some point? Or maybe try to work around this issue?

Jan 28 15:45:08 pxe1 kernel: [ 584.011005] BUG: unable to handle page fault for address: ffffb1e940139000
Jan 28 15:45:08 pxe1 kernel: [ 584.011011] #PF: supervisor read access in kernel mode
Jan 28 15:45:08 pxe1 kernel: [ 584.011013] #PF: error_code(0x0000) - not-present page
Jan 28 15:45:08 pxe1 kernel: [ 584.011014] PGD ff8554067 P4D ff8554067 PUD ff8555067 PMD ff8556067 PTE 0
Jan 28 15:45:08 pxe1 kernel: [ 584.011018] Oops: 0000 [#1] SMP NOPTI
Jan 28 15:45:08 pxe1 kernel: [ 584.011020] CPU: 6 PID: 290 Comm: kworker/6:3 Tainted: P O 5.3.13-2-pve #1
Jan 28 15:45:08 pxe1 kernel: [ 584.011022] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F10 11/15/2019
Jan 28 15:45:08 pxe1 kernel: [ 584.011028] Workqueue: events ccg_pm_workaround_work [ucsi_ccg]
Jan 28 15:45:08 pxe1 kernel: [ 584.011031] RIP: 0010:gpu_i2c_check_status.isra.5+0x52/0xa0 [i2c_nvidia_gpu]
Jan 28 15:45:08 pxe1 kernel: [ 584.011033] Code: 25 00 00 00 60 3d 00 00 00 60 75 24 be 58 02 00 00 bf f4 01 00 00 e8 ed 1b 8a d9 48 8b 05 c6 1c 3e da 4c 39 e0 79 09 49 8b 06 <8b> 18 85 db 78 ce 48 8b 05 b1 1c 3e da 49 39 c4 0f 88 b6 02 00 00
Jan 28 15:45:08 pxe1 kernel: [ 584.011036] RSP: 0018:ffffb1e940867cc0 EFLAGS: 00010287
Jan 28 15:45:08 pxe1 kernel: [ 584.011038] RAX: ffffb1e940139000 RBX: 00000000ffffffff RCX: 0000000000000000
Jan 28 15:45:08 pxe1 kernel: [ 584.011040] RDX: ffff94937e99db40 RSI: 0000000000000000 RDI: ffffb1e940867bf8
Jan 28 15:45:08 pxe1 kernel: [ 584.011041] RBP: ffffb1e940867ce0 R08: 0000000000000000 R09: 0000000000000003
Jan 28 15:45:08 pxe1 kernel: [ 584.011043] R10: 000000000000000e R11: ffff94937e9a94c4 R12: 0000000100011652
Jan 28 15:45:08 pxe1 kernel: [ 584.011044] R13: ffff94937799a818 R14: ffff94937799a820 R15: 0000000100011558
Jan 28 15:45:08 pxe1 kernel: [ 584.011046] FS: 0000000000000000(0000) GS:ffff94937e980000(0000) knlGS:0000000000000000
Jan 28 15:45:08 pxe1 kernel: [ 584.011048] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 28 15:45:08 pxe1 kernel: [ 584.011049] CR2: ffffb1e940139000 CR3: 0000000fcf584000 CR4: 0000000000340ee0
Jan 28 15:45:08 pxe1 kernel: [ 584.011051] Call Trace:
Jan 28 15:45:08 pxe1 kernel: [ 584.011055] gpu_i2c_master_xfer+0xe8/0x22d [i2c_nvidia_gpu]
Jan 28 15:45:08 pxe1 kernel: [ 584.011060] __i2c_transfer+0x180/0x4d0
Jan 28 15:45:08 pxe1 kernel: [ 584.011062] i2c_transfer+0x88/0x100
Jan 28 15:45:08 pxe1 kernel: [ 584.011064] ccg_read+0x11e/0x170 [ucsi_ccg]
Jan 28 15:45:08 pxe1 kernel: [ 584.011068] ? __switch_to_asm+0x40/0x70
Jan 28 15:45:08 pxe1 kernel: [ 584.011070] ? __switch_to_asm+0x40/0x70
Jan 28 15:45:08 pxe1 kernel: [ 584.011071] ucsi_ccg_sync+0x56/0xb0 [ucsi_ccg]
Jan 28 15:45:08 pxe1 kernel: [ 584.011075] ucsi_notify+0x26/0x120 [typec_ucsi]
Jan 28 15:45:08 pxe1 kernel: [ 584.011077] ccg_pm_workaround_work+0x15/0x20 [ucsi_ccg]
Jan 28 15:45:08 pxe1 kernel: [ 584.011080] process_one_work+0x20f/0x3d0
Jan 28 15:45:08 pxe1 kernel: [ 584.011082] worker_thread+0x34/0x400
Jan 28 15:45:08 pxe1 kernel: [ 584.011083] kthread+0x120/0x140
Jan 28 15:45:08 pxe1 kernel: [ 584.011084] ? process_one_work+0x3d0/0x3d0
Jan 28 15:45:08 pxe1 kernel: [ 584.011086] ? __kthread_parkme+0x70/0x70
Jan 28 15:45:08 pxe1 kernel: [ 584.011087] ret_from_fork+0x22/0x40
Jan 28 15:45:08 pxe1 kernel: [ 584.011089] Modules linked in: tcp_diag inet_diag veth md4 cmac nls_utf8 cifs libarc4 fscache ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables iptable_raw xt_mac ipt_REJECT nf_reject_ipv4 xt_physdev xt_addrtype xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp xt_comment xt_set xt_mark ip_set_hash_net ip_set edac_mce_amd kvm_amd kvm softdog iptable_filter bpfilter crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio aesni_intel aes_x86_64 snd_hda_intel crypto_simd uvcvideo snd_hda_codec ucsi_ccg snd_usb_audio cryptd videobuf2_vmalloc typec_ucsi glue_helper pcspkr videobuf2_memops wmi_bmof mxm_wmi k10temp videobuf2_v4l2 ccp snd_hda_core typec snd_usbmidi_lib vhost_net snd_hwdep vhost snd_rawmidi videobuf2_common tap snd_seq_device input_leds snd_pcm ib_iser videodev rdma_cm snd_timer iw_cm mc snd ib_cm joydev nfnetlink_log soundcore nfnetlink ib_core iscsi_tcp
Jan 28 15:45:08 pxe1 kernel: [ 584.011114] libiscsi_tcp libiscsi it87 hwmon_vid mac_hid bonding scsi_transport_iscsi vfio_pci vfio_virqfd irqbypass vfio_iommu_type1 vfio sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c hid_logitech_hidpp hid_logitech_dj usbmouse hid_generic usbkbd usbhid hid i2c_piix4 i2c_nvidia_gpu ahci libahci ixgbe igb i2c_algo_bit xfrm_algo dca mdio wmi
Jan 28 15:45:08 pxe1 kernel: [ 584.011141] CR2: ffffb1e940139000
Jan 28 15:45:08 pxe1 kernel: [ 584.011143] ---[ end trace d3c5b3c09bae9fc2 ]---
Jan 28 15:45:08 pxe1 kernel: [ 584.011145] RIP: 0010:gpu_i2c_check_status.isra.5+0x52/0xa0 [i2c_nvidia_gpu]
Jan 28 15:45:08 pxe1 kernel: [ 584.011146] Code: 25 00 00 00 60 3d 00 00 00 60 75 24 be 58 02 00 00 bf f4 01 00 00 e8 ed 1b 8a d9 48 8b 05 c6 1c 3e da 4c 39 e0 79 09 49 8b 06 <8b> 18 85 db 78 ce 48 8b 05 b1 1c 3e da 49 39 c4 0f 88 b6 02 00 00
Jan 28 15:45:08 pxe1 kernel: [ 584.011148] RSP: 0018:ffffb1e940867cc0 EFLAGS: 00010287
Jan 28 15:45:08 pxe1 kernel: [ 584.011150] RAX: ffffb1e940139000 RBX: 00000000ffffffff RCX: 0000000000000000
Jan 28 15:45:08 pxe1 kernel: [ 584.011151] RDX: ffff94937e99db40 RSI: 0000000000000000 RDI: ffffb1e940867bf8
Jan 28 15:45:08 pxe1 kernel: [ 584.011152] RBP: ffffb1e940867ce0 R08: 0000000000000000 R09: 0000000000000003
Jan 28 15:45:08 pxe1 kernel: [ 584.011153] R10: 000000000000000e R11: ffff94937e9a94c4 R12: 0000000100011652
Jan 28 15:45:08 pxe1 kernel: [ 584.011154] R13: ffff94937799a818 R14: ffff94937799a820 R15: 0000000100011558
Jan 28 15:45:08 pxe1 kernel: [ 584.011155] FS: 0000000000000000(0000) GS:ffff94937e980000(0000) knlGS:0000000000000000
Jan 28 15:45:08 pxe1 kernel: [ 584.011157] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 28 15:45:08 pxe1 kernel: [ 584.011158] CR2: ffffb1e940139000 CR3: 0000000fcf584000 CR4: 0000000000340ee0
Jan 28 15:45:08 pxe1 kernel: [ 584.019293] xhci_hcd 0000:08:00.3: remove, state 4
Jan 28 15:45:08 pxe1 kernel: [ 584.019303] usb usb4: USB disconnect, device number 1
Jan 28 15:45:08 pxe1 kernel: [ 584.019305] usb 4-3: USB disconnect, device number 2
Jan 28 15:45:08 pxe1 kernel: [ 584.019618] xhci_hcd 0000:08:00.3: USB bus 4 deregistered
Jan 28 15:45:08 pxe1 kernel: [ 584.019626] xhci_hcd 0000:08:00.3: remove, state 1
Jan 28 15:45:08 pxe1 kernel: [ 584.019630] usb usb3: USB disconnect, device number 1
Jan 28 15:45:08 pxe1 kernel: [ 584.019633] usb 3-3: USB disconnect, device number 2
Jan 28 15:45:08 pxe1 kernel: [ 584.019635] usb 3-3.1: USB disconnect, device number 4
Jan 28 15:45:08 pxe1 kernel: [ 584.086772] usb 3-3.2: USB disconnect, device number 5
Jan 28 15:45:08 pxe1 kernel: [ 584.110942] usb 3-3.3: USB disconnect, device number 6
Jan 28 15:45:08 pxe1 kernel: [ 584.206800] usb 3-6: USB disconnect, device number 3
Jan 28 15:45:08 pxe1 kernel: [ 584.223272] xhci_hcd 0000:08:00.3: USB bus 3 deregistered
Jan 28 15:45:08 pxe1 kernel: [ 584.246398] vfio-pci 0000:08:00.3: Refused to change power state, currently in D0
Jan 28 15:45:09 pxe1 systemd[1]: Started 102.scope.
Jan 28 15:45:09 pxe1 systemd-udevd[14303]: Using default interface naming scheme 'v240'.
Jan 28 15:45:09 pxe1 systemd-udevd[14303]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jan 28 15:45:09 pxe1 systemd-udevd[14303]: Could not generate persistent MAC address for tap102i0: No such file or directory
Jan 28 15:45:09 pxe1 kernel: [ 584.973474] device tap102i0 entered promiscuous mode
Jan 28 15:45:09 pxe1 kernel: [ 584.978244] vmbr0: port 3(tap102i0) entered blocking state
Jan 28 15:45:09 pxe1 kernel: [ 584.978248] vmbr0: port 3(tap102i0) entered disabled state
Jan 28 15:45:09 pxe1 kernel: [ 584.978319] vmbr0: port 3(tap102i0) entered blocking state
Jan 28 15:45:09 pxe1 kernel: [ 584.978321] vmbr0: port 3(tap102i0) entered forwarding state
Jan 28 15:45:11 pxe1 kernel: [ 587.008266] vfio-pci 0000:0b:00.0: enabling device (0000 -> 0003)
Jan 28 15:45:11 pxe1 kernel: [ 587.114556] vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
Jan 28 15:45:11 pxe1 kernel: [ 587.114587] vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Jan 28 15:45:11 pxe1 kernel: [ 587.194704] vfio-pci 0000:08:00.3: enabling device (0000 -> 0002)
 
You have to try.

in PCI passthrough there are many components involved that are not controlled by Proxmox VE.
A big impact has the HW firmware of the MB and the Device.
 
I had the same problem, I think.
You can try what I did and see if that helps.

That is a nice find and analysis!! Exactly the same for me (also with the failure on first VM boot which I also ignored since simply booting a second time works).

And now with blacklist i2c-nvidia-gpu added to /etc/modprobe.d/blacklist.conf kernel 5.3.13-3-pve and all VMs including the one which gets the RTX 2080 SUPER passed through work :)

Thanks so much for sharing!
 
That is a nice find and analysis!! Exactly the same for me (also with the failure on first VM boot which I also ignored since simply booting a second time works).

And now with blacklist i2c-nvidia-gpu added to /etc/modprobe.d/blacklist.conf kernel 5.3.13-3-pve and all VMs including the one which gets the RTX 2080 SUPER passed through work :)

Thanks so much for sharing!
Nice to hear that it helped you too!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!