openvswitch keeps crashing after proxmox 5 to 6 upgrade

szhel

Member
Oct 11, 2019
4
0
21
36
Hello all,

I have faced a problem. After upgrading proxmox 5 to 6, openvswitch keeps constantly crashes with such call trace:

Code:
Oct 11 20:19:26 vms kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
Oct 11 20:19:26 vms kernel: #PF error: [normal kernel read fault]
Oct 11 20:19:26 vms kernel: PGD 0 P4D 0
Oct 11 20:19:26 vms kernel: Oops: 0000 [#1] SMP PTI
Oct 11 20:19:26 vms kernel: CPU: 9 PID: 1472 Comm: handler34 Tainted: P          IO      5.0.21-2-pve #1
Oct 11 20:19:26 vms kernel: Hardware name: Supermicro X8DT6/X8DT6, BIOS 2.0c    05/15/2012
Oct 11 20:19:26 vms kernel: RIP: 0010:kmem_cache_alloc_node+0x84/0x200
Oct 11 20:19:26 vms kernel: Code: 89 01 00 00 4d 8b 07 65 49 8b 50 08 65 4c 03 05 4a 07 39 5d 4d 8b 10 4d 85 d2 74 1a 41 83 fd ff 0f 84 84 00 00 00 49 8b 40 10 <48> 8b 00 48 c1 e8 36 41 39 c5 74 74 48 8b 4d d0 44 89 ea 44 89 e6
Oct 11 20:19:26 vms kernel: RSP: 0018:ffffb1be08a0f9f8 EFLAGS: 00010213
Oct 11 20:19:26 vms kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Oct 11 20:19:26 vms kernel: RDX: 00000000022b0c12 RSI: 00000000006080c0 RDI: ffff961d97449200
Oct 11 20:19:26 vms kernel: RBP: ffffb1be08a0fa30 R08: ffff961d9f4eda20 R09: ffff961b24820720
Oct 11 20:19:26 vms kernel: R10: ffff961b8cc0abc0 R11: ffffffffa41ecf58 R12: 00000000006080c0
Oct 11 20:19:26 vms kernel: R13: 0000000000000000 R14: ffff961d97449200 R15: ffff961d97449200
Oct 11 20:19:26 vms kernel: FS:  00007f2a8ffff700(0000) GS:ffff961d9f4c0000(0000) knlGS:0000000000000000
Oct 11 20:19:26 vms kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 11 20:19:26 vms kernel: CR2: 0000000000000000 CR3: 00000017d3928005 CR4: 00000000000226e0
Oct 11 20:19:26 vms kernel: Call Trace:
Oct 11 20:19:26 vms kernel:  ? kmem_cache_alloc+0x15f/0x1d0
Oct 11 20:19:26 vms kernel:  ? ovs_flow_alloc+0x51/0x90 [openvswitch]
Oct 11 20:19:26 vms kernel:  ovs_flow_alloc+0x51/0x90 [openvswitch]
Oct 11 20:19:26 vms kernel:  ovs_packet_cmd_execute+0xdb/0x2a0 [openvswitch]
Oct 11 20:19:26 vms kernel:  genl_family_rcv_msg+0x1d8/0x410
Oct 11 20:19:26 vms kernel:  ? do_sys_poll+0x313/0x530
Oct 11 20:19:26 vms kernel:  genl_rcv_msg+0x4c/0xa0
Oct 11 20:19:26 vms kernel:  ? _cond_resched+0x19/0x30
Oct 11 20:19:26 vms kernel:  ? genl_family_rcv_msg+0x410/0x410
Oct 11 20:19:26 vms kernel:  netlink_rcv_skb+0x4f/0x120
Oct 11 20:19:26 vms kernel:  genl_rcv+0x28/0x40
Oct 11 20:19:26 vms kernel:  netlink_unicast+0x199/0x230
Oct 11 20:19:26 vms kernel:  netlink_sendmsg+0x20d/0x3c0
Oct 11 20:19:26 vms kernel:  sock_sendmsg+0x3e/0x50
Oct 11 20:19:26 vms kernel:  ___sys_sendmsg+0x295/0x2f0
Oct 11 20:19:26 vms kernel:  ? sock_poll+0x69/0xb0
Oct 11 20:19:26 vms kernel:  ? ep_send_events_proc+0xef/0x1f0
Oct 11 20:19:26 vms kernel:  ? ep_read_events_proc+0xd0/0xd0
Oct 11 20:19:26 vms kernel:  ? ep_scan_ready_list.constprop.23+0x1f0/0x200
Oct 11 20:19:26 vms kernel:  ? ep_poll+0x8b/0x450
Oct 11 20:19:26 vms kernel:  ? __fget_light+0x54/0x60
Oct 11 20:19:26 vms kernel:  __sys_sendmsg+0x5c/0xa0
Oct 11 20:19:26 vms kernel:  __x64_sys_sendmsg+0x1f/0x30
Oct 11 20:19:26 vms kernel:  do_syscall_64+0x5a/0x110
Oct 11 20:19:26 vms kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 11 20:19:26 vms kernel: RIP: 0033:0x7f2a96bd3467
Oct 11 20:19:26 vms kernel: Code: 44 00 00 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 3b ed ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 74 ed ff ff 48
Oct 11 20:19:26 vms kernel: RSP: 002b:00007f2a8ffa1100 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
Oct 11 20:19:26 vms kernel: RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2a96bd3467
Oct 11 20:19:26 vms kernel: RDX: 0000000000000000 RSI: 00007f2a8ffa1190 RDI: 0000000000000003
Oct 11 20:19:26 vms kernel: RBP: 00007f2a8ffa1190 R08: 0000000000000000 R09: 00007f2a8ffa2a10
Oct 11 20:19:26 vms kernel: R10: 00000000634f5d00 R11: 0000000000000293 R12: 0000000000000000
Oct 11 20:19:26 vms kernel: R13: 00007f2a8ffa29b8 R14: 00007f2a8ffa1630 R15: 00007f2a8ffa1190
Oct 11 20:19:26 vms kernel: Modules linked in: bluetooth ecdh_generic tcp_diag inet_diag dm_snapshot nfsv3 nfs_acl nfs lockd grace fscache xt_TCPMSS xt_tcpmss xt_policy ipt_MASQUERADE iptable_mangle iptable_nat binfmt_misc veth ebtable_filter ebtables ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables xt_mac ipt_REJECT nf_reject_ipv4 xt_NFLOG xt_limit xt_physdev xt_addrtype xt_multiport xt_conntrack xt_comment xt_tcpudp xt_set xt_mark ip_set_hash_net ip_set iptable_filter bpfilter openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 softdog nfnetlink_log intel_powerclamp kvm_intel kvm snd_hda_codec_hdmi crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel ttm aesni_intel snd_hda_intel drm_kms_helper snd_hda_codec zfs(PO) aes_x86_64 drm crypto_simd zunicode(PO) snd_hda_core cryptd i2c_algo_bit ipmi_si snd_hwdep glue_helper fb_sys_fops zlua(PO) syscopyarea ipmi_devintf snd_pcm sysfillrect ipmi_msghandler sysimgblt intel_cstate pcspkr serio_raw joydev
Oct 11 20:19:26 vms kernel:  input_leds snd_timer snd zcommon(PO) soundcore znvpair(PO) ioatdma zavl(PO) i7core_edac dca icp(PO) spl(O) vhost_net mac_hid vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi w83795 i5500_temp coretemp nfnetlink_queue nfnetlink vfio_pci vfio_virqfd irqbypass vfio_iommu_type1 sunrpc vfio ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c hid_generic usbmouse usbkbd usbhid hid gpio_ich ahci mpt3sas psmouse raid_class i2c_i801 libahci lpc_ich e1000e scsi_transport_sas
Oct 11 20:19:26 vms kernel: CR2: 0000000000000000
Oct 11 20:19:26 vms kernel: ---[ end trace 7b2a753239475296 ]---
Oct 11 20:19:26 vms kernel: RIP: 0010:kmem_cache_alloc_node+0x84/0x200
Oct 11 20:19:26 vms kernel: Code: 89 01 00 00 4d 8b 07 65 49 8b 50 08 65 4c 03 05 4a 07 39 5d 4d 8b 10 4d 85 d2 74 1a 41 83 fd ff 0f 84 84 00 00 00 49 8b 40 10 <48> 8b 00 48 c1 e8 36 41 39 c5 74 74 48 8b 4d d0 44 89 ea 44 89 e6
Oct 11 20:19:26 vms kernel: RSP: 0018:ffffb1be08a0f9f8 EFLAGS: 00010213
Oct 11 20:19:26 vms kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Oct 11 20:19:26 vms kernel: RDX: 00000000022b0c12 RSI: 00000000006080c0 RDI: ffff961d97449200
Oct 11 20:19:26 vms kernel: RBP: ffffb1be08a0fa30 R08: ffff961d9f4eda20 R09: ffff961b24820720
Oct 11 20:19:26 vms kernel: R10: ffff961b8cc0abc0 R11: ffffffffa41ecf58 R12: 00000000006080c0
Oct 11 20:19:26 vms kernel: R13: 0000000000000000 R14: ffff961d97449200 R15: ffff961d97449200
Oct 11 20:19:26 vms kernel: FS:  00007f2a8ffff700(0000) GS:ffff961d9f4c0000(0000) knlGS:0000000000000000
Oct 11 20:19:26 vms kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 11 20:19:26 vms kernel: CR2: 0000000000000000 CR3: 00000017d3928005 CR4: 00000000000226e0

Here is my /etc/network/interfaces

Code:
allow-vmbr0 bond0
iface bond0 inet manual
    ovs_bonds enp3s0 enp4s0
    ovs_type OVSBond
    ovs_bridge vmbr0
    ovs_options bond_mode=balance-slb lacp=active

auto lo
iface lo inet loopback

iface enp3s0 inet manual

iface enp4s0 inet manual

allow-vmbr0 vlan8int
iface vlan8int inet static
    address  9x.xxx.xxx.x1
    netmask  255.255.255.224
    gateway  9x.xx.xx.x3
    ovs_type OVSIntPort
    ovs_bridge vmbr0
    ovs_options tag=8

iface vlan8int inet6 static
    address  2xxx:xxxx:x::x1
    netmask  112
    gateway  2xxx:xxxx:x::x3

auto vmbr0
iface vmbr0 inet manual
    ovs_type OVSBridge
    ovs_ports bond0 vlan8int

service openvswitch-switch restart temporary fix the problem, but after some time it happens again. Maybe someone could give an advice for my case?
 
downgrading to the openvswitch version 2.6.2 from Debian Stretch looks like a workaround
 
Last edited:
maybe can you try kernel 5.3 from proxmox6 pvetest repository ?
The bug appears very randomly and in a production environment. I don't know how to manually trigger it, so I can't do such tests.
 
We are seeing the same issue on the following software. Has the cause been found for this issue? We see this on PVE nodes connected to Cumulus switches, but not others.

root@hyperpod3:/var/log# dpkg -l openvswitch-switch
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==================-===============================================-============-===================================
ii openvswitch-switch 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12+deb10u1 amd64 Open vSwitch switch implementations

root@hyperpod3:/var/log# uname -a
Linux hyperpod3 5.3.13-1-pve #1 SMP PVE 5.3.13-1 (Thu, 05 Dec 2019 07:18:14 +0100) x86_64 GNU/Linux
 
Continuing to see this:

Feb 17 12:06:37 hyperpod2 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Feb 17 12:06:37 hyperpod2 kernel: #PF: supervisor read access in kernel mode
Feb 17 12:06:37 hyperpod2 kernel: #PF: error_code(0x0000) - not-present page
Feb 17 12:06:37 hyperpod2 kernel: PGD 0 P4D 0
Feb 17 12:06:37 hyperpod2 kernel: Oops: 0000 [#1] SMP NOPTI
Feb 17 12:06:37 hyperpod2 kernel: CPU: 12 PID: 1547 Comm: handler40 Tainted: P O 5.3.18-1-pve #1
Feb 17 12:06:37 hyperpod2 kernel: Hardware name: /07YXFK, BIOS 1.11.4 09/26/2019
Feb 17 12:06:37 hyperpod2 kernel: RIP: 0010:kmem_cache_alloc_node+0x81/0x260
Feb 17 12:06:37 hyperpod2 kernel: Code: e4 01 00 00 4d 8b 07 65 49 8b 50 08 65 4c 03 05 bd b8 97 7a 4d 8b 10 4d 85 d2 74 1e 41 83 fd ff 0f 84 96 00 00 00 49 8b 40 10 <48> 8b 00 48 c1 e8 36 41 39 c5 0f 84 82 00 00 00 48 8b 4d d0 44 89
Feb 17 12:06:37 hyperpod2 kernel: RSP: 0018:ffffb76b421d79a8 EFLAGS: 00010213
Feb 17 12:06:37 hyperpod2 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Feb 17 12:06:37 hyperpod2 kernel: RDX: 00000000000073c2 RSI: 0000000000000dc0 RDI: ffff8da657bbe140
Feb 17 12:06:37 hyperpod2 kernel: RBP: ffffb76b421d79e0 R08: ffff8da65f0f2140 R09: ffff8da040008000
Feb 17 12:06:37 hyperpod2 kernel: R10: ffff8da077024700 R11: ffffffffc090e530 R12: 0000000000000dc0
Feb 17 12:06:37 hyperpod2 kernel: R13: 0000000000000000 R14: ffff8da657bbe140 R15: ffff8da657bbe140
Feb 17 12:06:37 hyperpod2 kernel: FS: 00007f4dccb4d700(0000) GS:ffff8da65f0c0000(0000) knlGS:0000000000000000
Feb 17 12:06:37 hyperpod2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 17 12:06:37 hyperpod2 kernel: CR2: 0000000000000000 CR3: 00000007fc3ac000 CR4: 00000000003406e0
Feb 17 12:06:37 hyperpod2 kernel: Call Trace:
Feb 17 12:06:37 hyperpod2 kernel: ? kmem_cache_alloc+0x16a/0x220
Feb 17 12:06:37 hyperpod2 kernel: ? ovs_flow_alloc+0x51/0x90 [openvswitch]
Feb 17 12:06:37 hyperpod2 kernel: ovs_flow_alloc+0x51/0x90 [openvswitch]
Feb 17 12:06:37 hyperpod2 kernel: ovs_packet_cmd_execute+0xdb/0x2a0 [openvswitch]
Feb 17 12:06:37 hyperpod2 kernel: genl_family_rcv_msg+0x1e3/0x480
Feb 17 12:06:37 hyperpod2 kernel: ? __netlink_sendskb+0x3f/0x50
Feb 17 12:06:37 hyperpod2 kernel: genl_rcv_msg+0x4c/0xa0
Feb 17 12:06:37 hyperpod2 kernel: ? genl_family_rcv_msg+0x480/0x480
Feb 17 12:06:37 hyperpod2 kernel: ? genl_family_rcv_msg+0x480/0x480
Feb 17 12:06:37 hyperpod2 kernel: netlink_rcv_skb+0x4f/0x120
Feb 17 12:06:37 hyperpod2 kernel: genl_rcv+0x28/0x40
Feb 17 12:06:37 hyperpod2 kernel: netlink_unicast+0x197/0x220
Feb 17 12:06:37 hyperpod2 kernel: netlink_sendmsg+0x227/0x3d0
Feb 17 12:06:37 hyperpod2 kernel: sock_sendmsg+0x63/0x70
Feb 17 12:06:37 hyperpod2 kernel: ____sys_sendmsg+0x1fa/0x270
Feb 17 12:06:37 hyperpod2 kernel: ? copy_msghdr_from_user+0xd5/0x150
Feb 17 12:06:37 hyperpod2 kernel: ___sys_sendmsg+0x7c/0xc0
Feb 17 12:06:37 hyperpod2 kernel: ? ___sys_recvmsg+0x87/0xc0
Feb 17 12:06:37 hyperpod2 kernel: ? ep_poll+0x88/0x420
Feb 17 12:06:37 hyperpod2 kernel: ? __fget_light+0x59/0x70
Feb 17 12:06:37 hyperpod2 kernel: __sys_sendmsg+0x5c/0xa0
Feb 17 12:06:37 hyperpod2 kernel: __x64_sys_sendmsg+0x1f/0x30
Feb 17 12:06:37 hyperpod2 kernel: do_syscall_64+0x5a/0x130
Feb 17 12:06:37 hyperpod2 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Feb 17 12:06:37 hyperpod2 kernel: RIP: 0033:0x7f4dce237467
Feb 17 12:06:37 hyperpod2 kernel: Code: 44 00 00 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 3b ed ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 74 ed ff ff 48
Feb 17 12:06:37 hyperpod2 kernel: RSP: 002b:00007f4dccaef100 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
Feb 17 12:06:37 hyperpod2 kernel: RAX: ffffffffffffffda RBX: 0000000000000103 RCX: 00007f4dce237467
Feb 17 12:06:37 hyperpod2 kernel: RDX: 0000000000000000 RSI: 00007f4dccaef190 RDI: 0000000000000103
Feb 17 12:06:37 hyperpod2 kernel: RBP: 00007f4dccaef190 R08: 0000000000000000 R09: 00007f4dccaf3508
Feb 17 12:06:37 hyperpod2 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
Feb 17 12:06:37 hyperpod2 kernel: R13: 00007f4dccaf34b0 R14: 00007f4dccaef640 R15: 00007f4dccaef190
Feb 17 12:06:37 hyperpod2 kernel: Modules linked in: dm_crypt algif_skcipher af_alg ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp iptable_filter bpfilter openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 softdog nfnetlink_log nfnetlink amd64_edac_mod edac_mce_amd ipmi_ssif kvm_amd kvm irqbypass zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel mgag200 aesni_intel drm_vram_helper aes_x86_64 ttm crypto_simd drm_kms_helper cryptd glue_helper drm i2c_algo_bit pcspkr fb_sys_fops syscopyarea sysfillrect sysimgblt ccp k10temp ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid zcommon(PO) znvpair(PO) spl(O) sunrpc vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c mlx5_ib ib_uverbs ib_core tg3 ahci i2c_piix4 libahci
Feb 17 12:06:37 hyperpod2 kernel: mlx5_core tls mlxfw
Feb 17 12:06:37 hyperpod2 kernel: CR2: 0000000000000000
Feb 17 12:06:37 hyperpod2 kernel: ---[ end trace 034a76fbbd78f617 ]---
 
We've installed these two packages:
http://odisoweb1.odiso.net/openvswitch-common_2.12.0-1_amd64.deb
http://odisoweb1.odiso.net/openvswitch-switch_2.12.0-1_amd64.deb

It worked for two days and then one of the nodes showed the same error:

Feb 23 15:30:01 hyperpod2 systemd[1]: Started Proxmox VE replication runner.
Feb 23 15:30:42 hyperpod2 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Feb 23 15:30:42 hyperpod2 kernel: #PF: supervisor read access in kernel mode
Feb 23 15:30:42 hyperpod2 kernel: #PF: error_code(0x0000) - not-present page
Feb 23 15:30:42 hyperpod2 kernel: PGD 0 P4D 0
Feb 23 15:30:42 hyperpod2 kernel: Oops: 0000 [#1] SMP NOPTI
Feb 23 15:30:42 hyperpod2 kernel: CPU: 27 PID: 1608 Comm: handler54 Tainted: P O 5.3.18-1-pve #1
Feb 23 15:30:42 hyperpod2 kernel: Hardware name: /07YXFK, BIOS 1.11.4 09/26/2019
Feb 23 15:30:42 hyperpod2 kernel: RIP: 0010:kmem_cache_alloc_node+0x81/0x260
Feb 23 15:30:42 hyperpod2 kernel: Code: e4 01 00 00 4d 8b 07 65 49 8b 50 08 65 4c 03 05 bd b8 37 62 4d 8b 10 4d 85 d2 74 1e 41 83 fd ff 0f 84 96 00 00 00 49 8b 40 10 <48> 8b 00 48
c1 e8 36 41 39 c5 0f 84 82 00 00 00 48 8b 4d d0 44 89
Feb 23 15:30:42 hyperpod2 kernel: RSP: 0018:ffffaa32825e79a8 EFLAGS: 00010213
Feb 23 15:30:42 hyperpod2 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Feb 23 15:30:42 hyperpod2 kernel: RDX: 00000000000120af RSI: 0000000000000dc0 RDI: ffff9832d7bbebc0
Feb 23 15:30:42 hyperpod2 kernel: RBP: ffffaa32825e79e0 R08: ffff9832dd3b2140 R09: ffff982bd2362980
Feb 23 15:30:42 hyperpod2 kernel: R10: ffff982c3be60600 R11: ffffffffc07d4530 R12: 0000000000000dc0
Feb 23 15:30:42 hyperpod2 kernel: R13: 0000000000000000 R14: ffff9832d7bbebc0 R15: ffff9832d7bbebc0
Feb 23 15:30:42 hyperpod2 kernel: FS: 00007fa679ffb700(0000) GS:ffff9832dd380000(0000) knlGS:0000000000000000
Feb 23 15:30:42 hyperpod2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 23 15:30:42 hyperpod2 kernel: CR2: 0000000000000000 CR3: 00000007fce96000 CR4: 00000000003406e0
Feb 23 15:30:42 hyperpod2 kernel: Call Trace:
Feb 23 15:30:42 hyperpod2 kernel: ? kmem_cache_alloc+0x16a/0x220
Feb 23 15:30:42 hyperpod2 kernel: ? ovs_flow_alloc+0x51/0x90 [openvswitch]
Feb 23 15:30:42 hyperpod2 kernel: ovs_flow_alloc+0x51/0x90 [openvswitch]
Feb 23 15:30:42 hyperpod2 kernel: ovs_packet_cmd_execute+0xdb/0x2a0 [openvswitch]
Feb 23 15:30:42 hyperpod2 kernel: genl_family_rcv_msg+0x1e3/0x480
Feb 23 15:30:42 hyperpod2 kernel: ? fput+0x13/0x15
Feb 23 15:30:42 hyperpod2 kernel: genl_rcv_msg+0x4c/0xa0
Feb 23 15:30:42 hyperpod2 kernel: ? genl_family_rcv_msg+0x480/0x480
Feb 23 15:30:42 hyperpod2 kernel: netlink_rcv_skb+0x4f/0x120
Feb 23 15:30:42 hyperpod2 kernel: genl_rcv+0x28/0x40
Feb 23 15:30:42 hyperpod2 kernel: netlink_unicast+0x197/0x220
Feb 23 15:30:42 hyperpod2 kernel: netlink_sendmsg+0x227/0x3d0
Feb 23 15:30:42 hyperpod2 kernel: sock_sendmsg+0x63/0x70
Feb 23 15:30:42 hyperpod2 kernel: ____sys_sendmsg+0x1fa/0x270
Feb 23 15:30:42 hyperpod2 kernel: ? copy_msghdr_from_user+0xd5/0x150
Feb 23 15:30:42 hyperpod2 kernel: ___sys_sendmsg+0x7c/0xc0
Feb 23 15:30:42 hyperpod2 kernel: ? ep_item_poll.isra.20+0x44/0xc0
Feb 23 15:30:42 hyperpod2 kernel: ? ep_send_events_proc+0xef/0x1f0
Feb 23 15:30:42 hyperpod2 kernel: ? ep_read_events_proc+0xd0/0xd0
Feb 23 15:30:42 hyperpod2 kernel: ? ep_scan_ready_list.constprop.24+0x20d/0x220
Feb 23 15:30:42 hyperpod2 kernel: ? ep_poll+0x88/0x420
Feb 23 15:30:42 hyperpod2 kernel: ? __fget_light+0x59/0x70
Feb 23 15:30:42 hyperpod2 kernel: __sys_sendmsg+0x5c/0xa0
Feb 23 15:30:42 hyperpod2 kernel: __x64_sys_sendmsg+0x1f/0x30
Feb 23 15:30:42 hyperpod2 kernel: do_syscall_64+0x5a/0x130
Feb 23 15:30:42 hyperpod2 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Feb 23 15:30:42 hyperpod2 kernel: RIP: 0033:0x7fa69c1c0467
Feb 23 15:30:42 hyperpod2 kernel: Code: 44 00 00 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 3b ed ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 74 ed ff ff 48
Feb 23 15:30:42 hyperpod2 kernel: RSP: 002b:00007fa679f9bed0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
Feb 23 15:30:42 hyperpod2 kernel: RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007fa69c1c0467
Feb 23 15:30:42 hyperpod2 kernel: RDX: 0000000000000000 RSI: 00007fa679f9bf60 RDI: 0000000000000007
Feb 23 15:30:42 hyperpod2 kernel: RBP: 00007fa679f9bf60 R08: 0000000000000000 R09: 00007fa64800b2a0
Feb 23 15:30:42 hyperpod2 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
Feb 23 15:30:42 hyperpod2 kernel: R13: 00007fa679f9d788 R14: 00007fa679f9c400 R15: 00007fa679f9bf60
Feb 23 15:30:42 hyperpod2 kernel: Modules linked in: dm_snapshot tcp_diag inet_diag dm_crypt algif_skcipher af_alg ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp iptable_filter bpfilter openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 softdog nfnetlink_log nfnetlink amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass ipmi_ssif zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel mgag200 aes_x86_64 drm_vram_helper crypto_simd cryptd ttm glue_helper drm_kms_helper joydev input_leds pcspkr drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt ccp k10temp ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c mlx5_ib
Feb 23 15:30:42 hyperpod2 kernel: ib_uverbs ib_core hid_generic usbmouse usbkbd usbhid tg3 hid ahci libahci mlx5_core i2c_piix4 tls mlxfw
Feb 23 15:30:42 hyperpod2 kernel: CR2: 0000000000000000
Feb 23 15:30:42 hyperpod2 kernel: ---[ end trace bed30caa5a6b348c ]---
Feb 23 15:30:42 hyperpod2 kernel: RIP: 0010:kmem_cache_alloc_node+0x81/0x260
Feb 23 15:30:42 hyperpod2 kernel: Code: e4 01 00 00 4d 8b 07 65 49 8b 50 08 65 4c 03 05 bd b8 37 62 4d 8b 10 4d 85 d2 74 1e 41 83 fd ff 0f 84 96 00 00 00 49 8b 40 10 <48> 8b 00 48 c1 e8 36 41 39 c5 0f 84 82 00 00 00 48 8b 4d d0 44 89
Feb 23 15:30:42 hyperpod2 kernel: RSP: 0018:ffffaa32825e79a8 EFLAGS: 00010213
Feb 23 15:30:42 hyperpod2 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Feb 23 15:30:42 hyperpod2 kernel: RDX: 00000000000120af RSI: 0000000000000dc0 RDI: ffff9832d7bbebc0
Feb 23 15:30:42 hyperpod2 kernel: RBP: ffffaa32825e79e0 R08: ffff9832dd3b2140 R09: ffff982bd2362980
Feb 23 15:30:42 hyperpod2 kernel: R10: ffff982c3be60600 R11: ffffffffc07d4530 R12: 0000000000000dc0
Feb 23 15:30:42 hyperpod2 kernel: R13: 0000000000000000 R14: ffff9832d7bbebc0 R15: ffff9832d7bbebc0
Feb 23 15:30:42 hyperpod2 kernel: FS: 00007fa679ffb700(0000) GS:ffff9832dd380000(0000) knlGS:0000000000000000
Feb 23 15:30:42 hyperpod2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 23 15:30:42 hyperpod2 kernel: CR2: 0000000000000000 CR3: 00000007fce96000 CR4: 00000000003406e0
Feb 23 15:30:51 hyperpod2 corosync[2196]: [KNET ] link: host: 3 link: 0 is down
Feb 23 15:30:51 hyperpod2 corosync[2196]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Feb 23 15:30:51 hyperpod2 corosync[2196]: [TOTEM ] Retransmit List: c7c22
Feb 23 15:30:51 hyperpod2 corosync[2196]: [TOTEM ] Retransmit List: c7c22 c7c24

The node lost connectivity over ovs interfaces. Cluster communication switched to the redundant link 1 that doesn't use ovs. Eventually I restarted openvswitch-switch.service, and all interfaces started working again.
 
/etc/network/interfaces:

auto lo
iface lo inet loopback
iface eno1 inet manual
iface ens3f0 inet manual
iface ens3f1 inet manual

# Private VLAN 13 for RRP
auto vmbr1
iface vmbr1 inet static
address 10.10.13.107
netmask 24
bridge-ports eno1
bridge-stp off
bridge-fd 0
#RRP

# Bond ens3f0 and ens3f1 together
allow-vmbr0 bond0
iface bond0 inet manual
ovs_bonds ens3f0 ens3f1
ovs_type OVSBond
ovs_bridge vmbr0
ovs_options bond_mode=balance-tcp lacp=active

allow-ovs vmbr0

auto vmbr0
iface vmbr0 inet manual
ovs_type OVSBridge
ovs_ports bond0 vlan16 vlan20 vlan60
# NOTE: we MUST mention bond0, vlans even though each
# of them lists ovs_bridge vmbr0! Not sure why it needs this
# kind of cross-referencing but it won't work without it!

# vlan to access the host
allow-vmbr0 vlan60
iface vlan60 inet static
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=60
ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
address 192.168.60.107
netmask 255.255.255.0
gateway 192.168.60.1
mtu 1500


# Proxmox cluster communication vlan
allow-vmbr0 vlan16
iface vlan16 inet static
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=16
ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
address 10.10.15.107
netmask 255.255.255.0
mtu 1500

# Ceph cluster communication vlan
allow-vmbr0 vlan20
iface vlan20 inet static
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=20
ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
address 10.10.20.107
netmask 255.255.255.0
mtu 1500

Thank you for looking into this.
 
FWIW I am also getting occasional openvswitch crashes similar in nature since the upgrade to 6. Apparently there's no watchdog reboot enabled; system just locks up with kernel logs dumped.
 
Hi guys,

could you send which exact nic model do you use with ovs ?
(lspci -nn) .

I wonder if it couldn't be a problem not related to ovs. (I have see proxmox users reported bug with intel e1000e driver for example)
 
01:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
Thanks for reporting. I have a lot of connectx-4/5 in production, and I don't have any problem with linux bridge or with their drivers.
So, it's 100% ovs bug here.

openvswitch 2.12 is now available in pve-no-subscription repo, does somebody have already tested it ?
 
We've upgraded RAM on our servers from 32G to 128G. We haven't seen the crash for a week now.
 
We've tested 2.12 packages from http://odisoweb1.odiso.net that spirit mentioned in the past, but eventually got the same error.
2.12 in pve-no-subscription are from same source, so no luck here.
It could really be a ovs kernel module problem, but I really can't reproduce here :/
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!