NIC flaps on all nodes within cluster at the same time

nielsh · May 16, 2024

Hello,

We have a Proxmox Cluster with 6 nodes. We are using local storage, and no HA/auto-failover features.
We recently had a DDoS aimed at a VM on one of these nodes.
The DDoS lasted several minutes and during that time we noticed that the NICs on all nodes in the cluser were unstable, despite those nodes not being targeted by the DDoS.

In the logs we see the following on a node that was not affected by the DDoS but still had a flapping NIC:

Code:

2024-05-16T21:19:58.847622+02:00 vmh001 corosync[3916]:   [KNET  ] link: host: 5 link: 0 is down
2024-05-16T21:19:58.847789+02:00 vmh001 corosync[3916]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
2024-05-16T21:19:58.847809+02:00 vmh001 corosync[3916]:   [KNET  ] host: host: 5 has no active links
2024-05-16T21:20:02.927030+02:00 vmh001 corosync[3916]:   [TOTEM ] Token has not been received in 4200 ms
2024-05-16T21:20:04.327044+02:00 vmh001 corosync[3916]:   [TOTEM ] A processor failed, forming new configuration: token timed out (5600ms), waiting 6720ms for consensus.
2024-05-16T21:20:11.062124+02:00 vmh001 corosync[3916]:   [QUORUM] Sync members[5]: 1 2 3 4 6
2024-05-16T21:20:11.062347+02:00 vmh001 corosync[3916]:   [QUORUM] Sync left[1]: 5
2024-05-16T21:20:11.062369+02:00 vmh001 corosync[3916]:   [TOTEM ] A new membership (1.252) was formed. Members left: 5
2024-05-16T21:20:11.062391+02:00 vmh001 corosync[3916]:   [TOTEM ] Failed to receive the leave message. failed: 5
2024-05-16T21:20:11.086922+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: members: 1/3947, 2/3928, 3/3923, 4/3869, 6/3810
2024-05-16T21:20:11.087015+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: starting data syncronisation
2024-05-16T21:20:11.087068+02:00 vmh001 pmxcfs[3810]: [status] notice: members: 1/3947, 2/3928, 3/3923, 4/3869, 6/3810
2024-05-16T21:20:11.087093+02:00 vmh001 pmxcfs[3810]: [status] notice: starting data syncronisation
2024-05-16T21:20:11.095110+02:00 vmh001 corosync[3916]:   [QUORUM] Members[5]: 1 2 3 4 6
2024-05-16T21:20:11.095166+02:00 vmh001 corosync[3916]:   [MAIN  ] Completed service synchronization, ready to provide service.
2024-05-16T21:20:11.188180+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: received sync request (epoch 1/3947/0000000A)
2024-05-16T21:20:11.189120+02:00 vmh001 pmxcfs[3810]: [status] notice: received sync request (epoch 1/3947/0000000A)
2024-05-16T21:20:11.193177+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: received all states
2024-05-16T21:20:11.193210+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: leader is 1/3947
2024-05-16T21:20:11.193230+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: synced members: 1/3947, 2/3928, 3/3923, 4/3869, 6/3810
2024-05-16T21:20:11.193248+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: all data is up to date
2024-05-16T21:20:11.193288+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: dfsm_deliver_queue: queue length 10
2024-05-16T21:20:11.205968+02:00 vmh001 pmxcfs[3810]: [status] notice: received all states
2024-05-16T21:20:11.206294+02:00 vmh001 pmxcfs[3810]: [status] notice: all data is up to date
2024-05-16T21:20:11.206317+02:00 vmh001 pmxcfs[3810]: [status] notice: dfsm_deliver_queue: queue length 233
2024-05-16T21:20:12.212217+02:00 vmh001 pvescheduler[3762571]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
2024-05-16T21:20:17.052760+02:00 vmh001 corosync[3916]:   [KNET  ] rx: host: 5 link: 0 is up
2024-05-16T21:20:17.052905+02:00 vmh001 corosync[3916]:   [KNET  ] link: Resetting MTU for link 0 because host 5 joined
2024-05-16T21:20:17.052946+02:00 vmh001 corosync[3916]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
2024-05-16T21:20:17.068350+02:00 vmh001 corosync[3916]:   [KNET  ] pmtud: Global data MTU changed to: 1397
2024-05-16T21:20:17.766114+02:00 vmh001 corosync[3916]:   [QUORUM] Sync members[6]: 1 2 3 4 5 6
2024-05-16T21:20:17.766231+02:00 vmh001 corosync[3916]:   [QUORUM] Sync joined[1]: 5
2024-05-16T21:20:17.766252+02:00 vmh001 corosync[3916]:   [TOTEM ] A new membership (1.256) was formed. Members joined: 5
2024-05-16T21:20:17.770415+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: members: 1/3947, 2/3928, 3/3923, 4/3869, 5/3854, 6/3810
2024-05-16T21:20:17.770458+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: starting data syncronisation
2024-05-16T21:20:17.770493+02:00 vmh001 pmxcfs[3810]: [status] notice: members: 1/3947, 2/3928, 3/3923, 4/3869, 5/3854, 6/3810
2024-05-16T21:20:17.770531+02:00 vmh001 pmxcfs[3810]: [status] notice: starting data syncronisation
2024-05-16T21:20:17.772826+02:00 vmh001 corosync[3916]:   [QUORUM] Members[6]: 1 2 3 4 5 6
2024-05-16T21:20:17.772856+02:00 vmh001 corosync[3916]:   [MAIN  ] Completed service synchronization, ready to provide service.
2024-05-16T21:20:17.871703+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: received sync request (epoch 1/3947/0000000B)
2024-05-16T21:20:17.872173+02:00 vmh001 pmxcfs[3810]: [status] notice: received sync request (epoch 1/3947/0000000B)
2024-05-16T21:20:17.885625+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: received all states
2024-05-16T21:20:17.885659+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: leader is 1/3947
2024-05-16T21:20:17.885697+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: synced members: 1/3947, 2/3928, 3/3923, 4/3869, 6/3810
2024-05-16T21:20:17.885715+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: all data is up to date
2024-05-16T21:20:17.891158+02:00 vmh001 pmxcfs[3810]: [status] notice: received all states
2024-05-16T21:20:17.891434+02:00 vmh001 pmxcfs[3810]: [status] notice: all data is up to date
2024-05-16T21:20:17.891451+02:00 vmh001 pmxcfs[3810]: [status] notice: dfsm_deliver_queue: queue length 17
2024-05-16T21:20:25.454317+02:00 vmh001 corosync[3916]:   [KNET  ] link: host: 5 link: 0 is down
2024-05-16T21:20:25.454473+02:00 vmh001 corosync[3916]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
2024-05-16T21:20:25.454491+02:00 vmh001 corosync[3916]:   [KNET  ] host: host: 5 has no active links
2024-05-16T21:20:28.952811+02:00 vmh001 corosync[3916]:   [TOTEM ] Token has not been received in 4200 ms
2024-05-16T21:20:30.352896+02:00 vmh001 corosync[3916]:   [TOTEM ] A processor failed, forming new configuration: token timed out (5600ms), waiting 6720ms for consensus.
2024-05-16T21:20:36.207821+02:00 vmh001 kernel: [1242621.704689] ice 0000:41:00.0 irdma0: ICE OICR event notification: oicr = 0x04000003
2024-05-16T21:20:36.207879+02:00 vmh001 kernel: [1242621.705070] ice 0000:41:00.0 irdma0: HMC Error
2024-05-16T21:20:36.207881+02:00 vmh001 kernel: [1242621.705401] ice 0000:41:00.0 irdma0: Requesting a reset
2024-05-16T21:20:36.217301+02:00 vmh001 kernel: [1242621.714860] ice 0000:41:00.1 irdma1: ICE OICR event notification: oicr = 0x04000003
2024-05-16T21:20:36.221643+02:00 vmh001 kernel: [1242621.718299] ice 0000:41:00.1 irdma1: HMC Error
2024-05-16T21:20:36.221652+02:00 vmh001 kernel: [1242621.718648] ice 0000:41:00.1 irdma1: Requesting a reset
2024-05-16T21:20:37.237575+02:00 vmh001 kernel: [1242622.734160] ice 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfd0608c0 flags=0x0000]
2024-05-16T21:20:37.237599+02:00 vmh001 kernel: [1242622.734470] ice 0000:41:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001b address=0xfd0608c0 flags=0x0000]

<snip> due to character limit, same AMD-Vi message repeats multiple times

2024-05-16T21:20:37.892572+02:00 vmh001 kernel: [1242623.389191] bond1: (slave enp65s0f1np1): link status definitely down, disabling slave
2024-05-16T21:20:37.894521+02:00 vmh001 kernel: [1242623.391200] bond0: (slave enp65s0f0np0): link status definitely down, disabling slave
2024-05-16T21:20:37.913521+02:00 vmh001 kernel: [1242623.410477] ice 0000:41:00.1: PTP reset successful
2024-05-16T21:20:38.387572+02:00 vmh001 kernel: [1242623.884618] ice 0000:41:00.0: PTP reset successful
2024-05-16T21:20:39.458794+02:00 vmh001 corosync[3916]:   [KNET  ] link: host: 1 link: 0 is down
2024-05-16T21:20:39.459017+02:00 vmh001 corosync[3916]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
2024-05-16T21:20:39.459039+02:00 vmh001 corosync[3916]:   [KNET  ] host: host: 1 has no active links
2024-05-16T21:20:40.230563+02:00 vmh001 kernel: [1242625.727252] ice 0000:41:00.1: VSI rebuilt. VSI index 0, type ICE_VSI_PF
2024-05-16T21:20:40.247556+02:00 vmh001 kernel: [1242625.744132] ice 0000:41:00.1: VSI rebuilt. VSI index 1, type ICE_VSI_CTRL
2024-05-16T21:20:40.285525+02:00 vmh001 kernel: [1242625.781415] bond1: (slave enp65s0f1np1): link status definitely up, 25000 Mbps full duplex
2024-05-16T21:20:40.285533+02:00 vmh001 kernel: [1242625.782181] bond1: active interface up!
2024-05-16T21:20:40.458534+02:00 vmh001 kernel: [1242625.955374] ice 0000:41:00.1 irdma0: ICE OICR event notification: oicr = 0x04000003
2024-05-16T21:20:40.458548+02:00 vmh001 kernel: [1242625.955626] ice 0000:41:00.1 irdma0: HMC Error
2024-05-16T21:20:40.458549+02:00 vmh001 kernel: [1242625.955811] ice 0000:41:00.1 irdma0: Requesting a reset
2024-05-16T21:20:40.763573+02:00 vmh001 kernel: [1242626.260114] bond1: (slave enp65s0f1np1): link status definitely down, disabling slave
2024-05-16T21:20:40.944571+02:00 vmh001 kernel: [1242626.440797] ice 0000:41:00.1: PTP reset successful
2024-05-16T21:20:42.259604+02:00 vmh001 corosync[3916]:   [KNET  ] rx: host: 1 link: 0 is up
2024-05-16T21:20:42.259789+02:00 vmh001 corosync[3916]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
2024-05-16T21:20:42.259809+02:00 vmh001 corosync[3916]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
2024-05-16T21:20:42.273590+02:00 vmh001 corosync[3916]:   [KNET  ] pmtud: Global data MTU changed to: 1397
2024-05-16T21:20:42.755543+02:00 vmh001 kernel: [1242628.252034] ice 0000:41:00.1: VSI rebuilt. VSI index 0, type ICE_VSI_PF
2024-05-16T21:20:42.772567+02:00 vmh001 kernel: [1242628.269239] ice 0000:41:00.1: VSI rebuilt. VSI index 1, type ICE_VSI_CTRL
2024-05-16T21:20:42.877297+02:00 vmh001 corosync[3916]:   [QUORUM] Sync members[5]: 1 2 3 4 6
2024-05-16T21:20:42.877405+02:00 vmh001 corosync[3916]:   [QUORUM] Sync left[1]: 5
2024-05-16T21:20:42.877423+02:00 vmh001 corosync[3916]:   [TOTEM ] A new membership (1.25e) was formed. Members left: 5
2024-05-16T21:20:42.877444+02:00 vmh001 corosync[3916]:   [TOTEM ] Failed to receive the leave message. failed: 5
2024-05-16T21:20:42.880936+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: members: 1/3947, 2/3928, 3/3923, 4/3869, 6/3810
2024-05-16T21:20:42.881010+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: starting data syncronisation
2024-05-16T21:20:42.881030+02:00 vmh001 pmxcfs[3810]: [status] notice: members: 1/3947, 2/3928, 3/3923, 4/3869, 6/3810
2024-05-16T21:20:42.881048+02:00 vmh001 pmxcfs[3810]: [status] notice: starting data syncronisation
2024-05-16T21:20:42.882789+02:00 vmh001 corosync[3916]:   [QUORUM] Members[5]: 1 2 3 4 6
2024-05-16T21:20:42.882816+02:00 vmh001 corosync[3916]:   [MAIN  ] Completed service synchronization, ready to provide service.
2024-05-16T21:20:42.982431+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: received sync request (epoch 1/3947/0000000C)
2024-05-16T21:20:42.982890+02:00 vmh001 pmxcfs[3810]: [status] notice: received sync request (epoch 1/3947/0000000C)
2024-05-16T21:20:42.995985+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: received all states
2024-05-16T21:20:42.996028+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: leader is 1/3947
2024-05-16T21:20:42.996048+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: synced members: 1/3947, 2/3928, 3/3923, 4/3869, 6/3810
2024-05-16T21:20:42.996065+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: all data is up to date
2024-05-16T21:20:42.996091+02:00 vmh001 pmxcfs[3810]: [dcdb] notice: dfsm_deliver_queue: queue length 5
2024-05-16T21:20:42.997096+02:00 vmh001 pmxcfs[3810]: [status] notice: received all states
2024-05-16T21:20:42.997275+02:00 vmh001 pmxcfs[3810]: [status] notice: all data is up to date
2024-05-16T21:20:42.997292+02:00 vmh001 pmxcfs[3810]: [status] notice: dfsm_deliver_queue: queue length 268
2024-05-16T21:20:43.058818+02:00 vmh001 kernel: [1242628.554898] ice 0000:41:00.0: VSI rebuilt. VSI index 0, type ICE_VSI_PF
2024-05-16T21:20:43.061529+02:00 vmh001 kernel: [1242628.557389] bond1: (slave enp65s0f1np1): link status definitely up, 25000 Mbps full duplex
2024-05-16T21:20:43.086573+02:00 vmh001 kernel: [1242628.582952] ice 0000:41:00.0: VSI rebuilt. VSI index 1, type ICE_VSI_CTRL
2024-05-16T21:20:43.099529+02:00 vmh001 kernel: [1242628.595287] bond0: (slave enp65s0f0np0): link status definitely up, 25000 Mbps full duplex
2024-05-16T21:20:43.099534+02:00 vmh001 kernel: [1242628.596108] bond0: active interface up!
2024-05-16T21:20:44.941670+02:00 vmh001 kernel: [1242630.438456] ice 0000:41:00.0 irdma1: ICE OICR event notification: oicr = 0x04000003
2024-05-16T21:20:44.941685+02:00 vmh001 kernel: [1242630.438699] ice 0000:41:00.0 irdma1: HMC Error
2024-05-16T21:20:44.941686+02:00 vmh001 kernel: [1242630.438887] ice 0000:41:00.0 irdma1: Requesting a reset
2024-05-16T21:20:45.203909+02:00 vmh001 kernel: [1242630.700022] bond0: (slave enp65s0f0np0): link status definitely down, disabling slave
2024-05-16T21:20:45.566597+02:00 vmh001 kernel: [1242631.062884] ice 0000:41:00.0: PTP reset successful
2024-05-16T21:20:47.860744+02:00 vmh001 corosync[3916]:   [KNET  ] rx: host: 5 link: 0 is up
2024-05-16T21:20:47.860896+02:00 vmh001 corosync[3916]:   [KNET  ] link: Resetting MTU for link 0 because host 5 joined
2024-05-16T21:20:47.860952+02:00 vmh001 corosync[3916]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
2024-05-16T21:20:47.876084+02:00 vmh001 corosync[3916]:   [KNET  ] pmtud: Global data MTU changed to: 1397
2024-05-16T21:20:48.897538+02:00 vmh001 kernel: [1242634.394533] ice 0000:41:00.0: VSI rebuilt. VSI index 0, type ICE_VSI_PF
2024-05-16T21:20:48.913573+02:00 vmh001 kernel: [1242634.410089] ice 0000:41:00.0: VSI rebuilt. VSI index 1, type ICE_VSI_CTRL
2024-05-16T21:20:48.946537+02:00 vmh001 kernel: [1242634.442554] bond0: (slave enp65s0f0np0): link status definitely up, 25000 Mbps full duplex
2024-05-16T21:20:50.660960+02:00 vmh001 corosync[3916]:   [KNET  ] link: host: 4 link: 0 is down
2024-05-16T21:20:50.661059+02:00 vmh001 corosync[3916]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
2024-05-16T21:20:50.661079+02:00 vmh001 corosync[3916]:   [KNET  ] host: host: 4 has no active links
2024-05-16T21:20:54.862099+02:00 vmh001 corosync[3916]:   [KNET  ] rx: host: 4 link: 0 is up
2024-05-16T21:20:54.862276+02:00 vmh001 corosync[3916]:   [KNET  ] link: Resetting MTU for link 0 because host 4 joined

Does anyone know what could be the cause?
We are running:

Code:

proxmox-ve: 8.2.0 (running kernel: 6.8.4-2-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.8: 6.8.4-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-15-pve: 6.2.16-15
pve-kernel-6.2.16-3-pve: 6.2.16-3
amd64-microcode: 3.20230808.1.1~deb12u1
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.0-1
proxmox-backup-file-restore: 3.2.0-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.1
pve-cluster: 8.0.6
pve-container: 5.0.11
pve-docs: 8.2.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.5
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

Thank you!

jacollini · Oct 11, 2024

Hi, have you managed to solve the problem? I think i have the same when load is higher

Search

Search

NIC flaps on all nodes within cluster at the same time

nielsh

New Member

jacollini

Active Member

We value your privacy