[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

elmacus · Sep 3, 2019

tom said:
Yes

Thanks, all i needed to know. Keep up the good work.

tom said:
and please read details on the links I posted.

No, im not a developer.

Harper · Sep 3, 2019

tom said:
Yes and please read details on the links I posted.

Hi Tom. Is there a specific thread that is tracking this issue? I looked at the links and there are a bunch of threads that could cover this. Which one should I be reading?

Thanks,
Harper

spirit · Sep 3, 2019

Hi Guys,
could you post your /etc/network/interfaces config , maybe we could find some similarity ? (bonding, ovs/linux bridge, mtu,....)

ahovda · Sep 4, 2019

elmacus said:
@ahovda i did that (almost same, see thread above) but did not help my cluster. Please report again after a week if it helps you.

You're right, it did not really help; the whole cluster crashed and I'm in late to recover.

I think I might have some problems with the bnx2 driver, since ethernet ports are going down and up.

I've searched and have added intremap=off to the kernel cmdline and after rebooting nodes I'll see how it goes. I'm considering adding the bnx2 parameter disable_msi=1 as well.

These are the adapters:

Code:

02:00.0 Ethernet controller: Broadcom Limited NetXtreme II BCM5716 Gigabit Ethernet (rev 20)
02:00.1 Ethernet controller: Broadcom Limited NetXtreme II BCM5716 Gigabit Ethernet (rev 20)

(I know - I should have separate cluster fabric and 10Gb when I'm also playing with Ceph, but it used to be rock solid with corosync2)

Next step would be to rule out any issues with switches and cables.

[ 820.947995] bnx2 0000:02:00.1 eno2: <--- start FTQ dump --->
[ 820.948477] bnx2 0000:02:00.1 eno2: RV2P_PFTQ_CTL 00010002
[ 820.948743] bnx2 0000:02:00.1 eno2: RV2P_TFTQ_CTL 00020000
[ 820.948997] bnx2 0000:02:00.1 eno2: RV2P_MFTQ_CTL 00004000
[ 820.949251] bnx2 0000:02:00.1 eno2: TBDR_FTQ_CTL 00004002
[ 820.949507] bnx2 0000:02:00.1 eno2: TDMA_FTQ_CTL 00010002
[ 820.949770] bnx2 0000:02:00.1 eno2: TXP_FTQ_CTL 00010002
[ 820.950031] bnx2 0000:02:00.1 eno2: TXP_FTQ_CTL 00010002
[ 820.950288] bnx2 0000:02:00.1 eno2: TPAT_FTQ_CTL 00010002
[ 820.950547] bnx2 0000:02:00.1 eno2: RXP_CFTQ_CTL 00008000
[ 820.950806] bnx2 0000:02:00.1 eno2: RXP_FTQ_CTL 00100000
[ 820.951066] bnx2 0000:02:00.1 eno2: COM_COMXQ_FTQ_CTL 00010000
[ 820.951329] bnx2 0000:02:00.1 eno2: COM_COMTQ_FTQ_CTL 00020000
[ 820.951592] bnx2 0000:02:00.1 eno2: COM_COMQ_FTQ_CTL 00010000
[ 820.951857] bnx2 0000:02:00.1 eno2: CP_CPQ_FTQ_CTL 00004000
[ 820.952255] bnx2 0000:02:00.1 eno2: CPU states:
[ 820.952567] bnx2 0000:02:00.1 eno2: 045000 mode b84c state 80005000 evt_mask 500 pc 8001294 pc 800128c instr 8e260000
[ 820.952884] bnx2 0000:02:00.1 eno2: 085000 mode b84c state 80001000 evt_mask 500 pc 8000a4c pc 8000a4c instr 38420001
[ 820.953196] bnx2 0000:02:00.1 eno2: 0c5000 mode b84c state 80001000 evt_mask 500 pc 8004c14 pc 8004c14 instr 32050003
[ 820.953511] bnx2 0000:02:00.1 eno2: 105000 mode b8cc state 80004000 evt_mask 500 pc 8000a9c pc 8000a9c instr 32620007
[ 820.953827] bnx2 0000:02:00.1 eno2: 145000 mode b880 state 80000000 evt_mask 500 pc 800b5ec pc 8000104 instr a0b821
[ 820.954147] bnx2 0000:02:00.1 eno2: 185000 mode b8cc state 80000000 evt_mask 500 pc 8000c6c pc 8000c74 instr 3c058000
[ 820.954460] bnx2 0000:02:00.1 eno2: <--- end FTQ dump --->
[ 820.954775] bnx2 0000:02:00.1 eno2: <--- start TBDC dump --->
[ 820.955094] bnx2 0000:02:00.1 eno2: TBDC free cnt: 32
[ 820.955414] bnx2 0000:02:00.1 eno2: LINE CID BIDX CMD VALIDS
[ 820.955747] bnx2 0000:02:00.1 eno2: 00 001300 6a00 00 [0]
[ 820.956198] bnx2 0000:02:00.1 eno2: 01 001200 a4f8 00 [0]
[ 820.956556] bnx2 0000:02:00.1 eno2: 02 001300 6a00 00 [0]
[ 820.956886] bnx2 0000:02:00.1 eno2: 03 001100 d308 00 [0]
[ 820.957201] bnx2 0000:02:00.1 eno2: 04 001280 3680 00 [0]
[ 820.957510] bnx2 0000:02:00.1 eno2: 05 001100 ce70 00 [0]
[ 820.957814] bnx2 0000:02:00.1 eno2: 06 001100 cb48 00 [0]
[ 820.958113] bnx2 0000:02:00.1 eno2: 07 001100 cb50 00 [0]
[ 820.958408] bnx2 0000:02:00.1 eno2: 08 001300 63a0 00 [0]
[ 820.958695] bnx2 0000:02:00.1 eno2: 09 001080 2760 00 [0]
[ 820.958975] bnx2 0000:02:00.1 eno2: 0a 000800 6990 00 [0]
[ 820.959250] bnx2 0000:02:00.1 eno2: 0b 001300 6368 00 [0]
[ 820.959507] bnx2 0000:02:00.1 eno2: 0c 001200 9c88 00 [0]
[ 820.959757] bnx2 0000:02:00.1 eno2: 0d 001200 9c90 00 [0]
[ 820.959999] bnx2 0000:02:00.1 eno2: 0e 001180 4618 00 [0]
[ 820.960431] bnx2 0000:02:00.1 eno2: 0f 001100 80c8 00 [0]
[ 820.960696] bnx2 0000:02:00.1 eno2: 10 001300 1810 00 [0]
[ 820.960957] bnx2 0000:02:00.1 eno2: 11 001200 1978 00 [0]
[ 820.961220] bnx2 0000:02:00.1 eno2: 12 001100 80d8 00 [0]
[ 820.961477] bnx2 0000:02:00.1 eno2: 13 001000 65b0 00 [0]
[ 820.961730] bnx2 0000:02:00.1 eno2: 14 001100 80d0 00 [0]
[ 820.961978] bnx2 0000:02:00.1 eno2: 15 001100 80e0 00 [0]
[ 820.962219] bnx2 0000:02:00.1 eno2: 16 001280 e5f8 00 [0]
[ 820.962458] bnx2 0000:02:00.1 eno2: 17 001280 e600 00 [0]
[ 820.962696] bnx2 0000:02:00.1 eno2: 18 001200 1948 00 [0]
[ 820.962936] bnx2 0000:02:00.1 eno2: 19 001200 1950 00 [0]
[ 820.963172] bnx2 0000:02:00.1 eno2: 1a 001280 e5d0 00 [0]
[ 820.963408] bnx2 0000:02:00.1 eno2: 1b 001280 e5d8 00 [0]
[ 820.963642] bnx2 0000:02:00.1 eno2: 1c 001280 e5e0 00 [0]
[ 820.963876] bnx2 0000:02:00.1 eno2: 1d 001280 e5e8 00 [0]
[ 820.964122] bnx2 0000:02:00.1 eno2: 1e 001180 f7c0 00 [0]
[ 820.964355] bnx2 0000:02:00.1 eno2: 1f 001180 f7c8 00 [0]
[ 820.964562] bnx2 0000:02:00.1 eno2: <--- end TBDC dump --->
[ 820.964771] bnx2 0000:02:00.1 eno2: DEBUG: intr_sem[0] PCI_CMD[00100406]
[ 820.964985] bnx2 0000:02:00.1 eno2: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000088]
[ 820.965200] bnx2 0000:02:00.1 eno2: DEBUG: EMAC_TX_STATUS[0000000e] EMAC_RX_STATUS[00000000]
[ 820.965420] bnx2 0000:02:00.1 eno2: DEBUG: RPM_MGMT_PKT_CTRL[40000088]
[ 820.965640] bnx2 0000:02:00.1 eno2: DEBUG: HC_STATS_INTERRUPT_STATUS[01fe0001]
[ 820.965854] bnx2 0000:02:00.1 eno2: DEBUG: PBA[00000000]
[ 820.966065] bnx2 0000:02:00.1 eno2: <--- start MCP states dump --->
[ 820.966284] bnx2 0000:02:00.1 eno2: DEBUG: MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e]
[ 820.966529] bnx2 0000:02:00.1 eno2: DEBUG: MCP mode[0000b880] state[80008000] evt_mask[00000500]
[ 820.966769] bnx2 0000:02:00.1 eno2: DEBUG: pc[080044b8] pc[080009f4] instr[30420003]
[ 820.967002] bnx2 0000:02:00.1 eno2: DEBUG: shmem states:
[ 820.967228] bnx2 0000:02:00.1 eno2: DEBUG: drv_mb[01030003] fw_mb[00000003] link_status[0000006f]
[ 820.967463] drv_pulse_mb[000002fa]
[ 820.967467] bnx2 0000:02:00.1 eno2: DEBUG: dev_info_signature[44564907] reset_type[01005254]
[ 820.967705] condition[0003610e] [ 820.967712] bnx2 0000:02:00.1 eno2: DEBUG: 000001c0: 01005254 42530000 0003610e 00000000
[ 820.967958] bnx2 0000:02:00.1 eno2: DEBUG: 000003cc: 00000000 00000000 00000000 00000000 [ 820.968354] bnx2 0000:02:00.1 eno2: DEBUG: 000003dc: 00000000 00000000 00000000 00000000
[ 820.968632] bnx2 0000:02:00.1 eno2: DEBUG: 000003ec: 00000000 00000000 00000000 00000000
[ 820.968887] bnx2 0000:02:00.1 eno2: DEBUG: 0x3fc[00000000]
[ 820.969134] bnx2 0000:02:00.1 eno2: <--- end MCP states dump --->
[ 821.001111] CE: hpet increased min_delta_ns to 67887 nsec
[ 821.081255] bnx2 0000:02:00.1 eno2: NIC Copper Link is Down
[ 824.312911] bnx2 0000:02:00.1 eno2: NIC Copper Link is Up, 1000 Mbps full duplex

spirit · Sep 4, 2019

ahovda said:
You're right, it did not really help; the whole cluster crashed and I'm in late to recover. I think I might have some problems with the bnx2 driver, since ethernet ports are going down and up.

I've searched and have added intremap=off to the kernel cmdline and after rebooting nodes I'll see how it goes. I'm considering adding the bnx2 parameter disable_msi=1 as well.

These are the adapters:

Code:

02:00.0 Ethernet controller: Broadcom Limited NetXtreme II BCM5716 Gigabit Ethernet (rev 20) 02:00.1 Ethernet controller: Broadcom Limited NetXtreme II BCM5716 Gigabit Ethernet (rev 20)

(I know - I should have separate cluster fabric and 10Gb when I'm also playing with Ceph, but it used to be rock solid with corosync2)

Next step would be to rule out any issues with switches and cables.

[ 820.947995] bnx2 0000:02:00.1 eno2: <--- start FTQ dump --->
[ 820.948477] bnx2 0000:02:00.1 eno2: RV2P_PFTQ_CTL 00010002
[ 820.948743] bnx2 0000:02:00.1 eno2: RV2P_TFTQ_CTL 00020000
[ 820.948997] bnx2 0000:02:00.1 eno2: RV2P_MFTQ_CTL 00004000
[ 820.949251] bnx2 0000:02:00.1 eno2: TBDR_FTQ_CTL 00004002
[ 820.949507] bnx2 0000:02:00.1 eno2: TDMA_FTQ_CTL 00010002
[ 820.949770] bnx2 0000:02:00.1 eno2: TXP_FTQ_CTL 00010002
[ 820.950031] bnx2 0000:02:00.1 eno2: TXP_FTQ_CTL 00010002
[ 820.950288] bnx2 0000:02:00.1 eno2: TPAT_FTQ_CTL 00010002
[ 820.950547] bnx2 0000:02:00.1 eno2: RXP_CFTQ_CTL 00008000
[ 820.950806] bnx2 0000:02:00.1 eno2: RXP_FTQ_CTL 00100000
[ 820.951066] bnx2 0000:02:00.1 eno2: COM_COMXQ_FTQ_CTL 00010000
[ 820.951329] bnx2 0000:02:00.1 eno2: COM_COMTQ_FTQ_CTL 00020000
[ 820.951592] bnx2 0000:02:00.1 eno2: COM_COMQ_FTQ_CTL 00010000
[ 820.951857] bnx2 0000:02:00.1 eno2: CP_CPQ_FTQ_CTL 00004000
[ 820.952255] bnx2 0000:02:00.1 eno2: CPU states:
[ 820.952567] bnx2 0000:02:00.1 eno2: 045000 mode b84c state 80005000 evt_mask 500 pc 8001294 pc 800128c instr 8e260000
[ 820.952884] bnx2 0000:02:00.1 eno2: 085000 mode b84c state 80001000 evt_mask 500 pc 8000a4c pc 8000a4c instr 38420001
[ 820.953196] bnx2 0000:02:00.1 eno2: 0c5000 mode b84c state 80001000 evt_mask 500 pc 8004c14 pc 8004c14 instr 32050003
[ 820.953511] bnx2 0000:02:00.1 eno2: 105000 mode b8cc state 80004000 evt_mask 500 pc 8000a9c pc 8000a9c instr 32620007
[ 820.953827] bnx2 0000:02:00.1 eno2: 145000 mode b880 state 80000000 evt_mask 500 pc 800b5ec pc 8000104 instr a0b821
[ 820.954147] bnx2 0000:02:00.1 eno2: 185000 mode b8cc state 80000000 evt_mask 500 pc 8000c6c pc 8000c74 instr 3c058000
[ 820.954460] bnx2 0000:02:00.1 eno2: <--- end FTQ dump --->
[ 820.954775] bnx2 0000:02:00.1 eno2: <--- start TBDC dump --->
[ 820.955094] bnx2 0000:02:00.1 eno2: TBDC free cnt: 32
[ 820.955414] bnx2 0000:02:00.1 eno2: LINE CID BIDX CMD VALIDS
[ 820.955747] bnx2 0000:02:00.1 eno2: 00 001300 6a00 00 [0]
[ 820.956198] bnx2 0000:02:00.1 eno2: 01 001200 a4f8 00 [0]
[ 820.956556] bnx2 0000:02:00.1 eno2: 02 001300 6a00 00 [0]
[ 820.956886] bnx2 0000:02:00.1 eno2: 03 001100 d308 00 [0]
[ 820.957201] bnx2 0000:02:00.1 eno2: 04 001280 3680 00 [0]
[ 820.957510] bnx2 0000:02:00.1 eno2: 05 001100 ce70 00 [0]
[ 820.957814] bnx2 0000:02:00.1 eno2: 06 001100 cb48 00 [0]
[ 820.958113] bnx2 0000:02:00.1 eno2: 07 001100 cb50 00 [0]
[ 820.958408] bnx2 0000:02:00.1 eno2: 08 001300 63a0 00 [0]
[ 820.958695] bnx2 0000:02:00.1 eno2: 09 001080 2760 00 [0]
[ 820.958975] bnx2 0000:02:00.1 eno2: 0a 000800 6990 00 [0]
[ 820.959250] bnx2 0000:02:00.1 eno2: 0b 001300 6368 00 [0]
[ 820.959507] bnx2 0000:02:00.1 eno2: 0c 001200 9c88 00 [0]
[ 820.959757] bnx2 0000:02:00.1 eno2: 0d 001200 9c90 00 [0]
[ 820.959999] bnx2 0000:02:00.1 eno2: 0e 001180 4618 00 [0]
[ 820.960431] bnx2 0000:02:00.1 eno2: 0f 001100 80c8 00 [0]
[ 820.960696] bnx2 0000:02:00.1 eno2: 10 001300 1810 00 [0]
[ 820.960957] bnx2 0000:02:00.1 eno2: 11 001200 1978 00 [0]
[ 820.961220] bnx2 0000:02:00.1 eno2: 12 001100 80d8 00 [0]
[ 820.961477] bnx2 0000:02:00.1 eno2: 13 001000 65b0 00 [0]
[ 820.961730] bnx2 0000:02:00.1 eno2: 14 001100 80d0 00 [0]
[ 820.961978] bnx2 0000:02:00.1 eno2: 15 001100 80e0 00 [0]
[ 820.962219] bnx2 0000:02:00.1 eno2: 16 001280 e5f8 00 [0]
[ 820.962458] bnx2 0000:02:00.1 eno2: 17 001280 e600 00 [0]
[ 820.962696] bnx2 0000:02:00.1 eno2: 18 001200 1948 00 [0]
[ 820.962936] bnx2 0000:02:00.1 eno2: 19 001200 1950 00 [0]
[ 820.963172] bnx2 0000:02:00.1 eno2: 1a 001280 e5d0 00 [0]
[ 820.963408] bnx2 0000:02:00.1 eno2: 1b 001280 e5d8 00 [0]
[ 820.963642] bnx2 0000:02:00.1 eno2: 1c 001280 e5e0 00 [0]
[ 820.963876] bnx2 0000:02:00.1 eno2: 1d 001280 e5e8 00 [0]
[ 820.964122] bnx2 0000:02:00.1 eno2: 1e 001180 f7c0 00 [0]
[ 820.964355] bnx2 0000:02:00.1 eno2: 1f 001180 f7c8 00 [0]
[ 820.964562] bnx2 0000:02:00.1 eno2: <--- end TBDC dump --->
[ 820.964771] bnx2 0000:02:00.1 eno2: DEBUG: intr_sem[0] PCI_CMD[00100406]
[ 820.964985] bnx2 0000:02:00.1 eno2: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000088]
[ 820.965200] bnx2 0000:02:00.1 eno2: DEBUG: EMAC_TX_STATUS[0000000e] EMAC_RX_STATUS[00000000]
[ 820.965420] bnx2 0000:02:00.1 eno2: DEBUG: RPM_MGMT_PKT_CTRL[40000088]
[ 820.965640] bnx2 0000:02:00.1 eno2: DEBUG: HC_STATS_INTERRUPT_STATUS[01fe0001]
[ 820.965854] bnx2 0000:02:00.1 eno2: DEBUG: PBA[00000000]
[ 820.966065] bnx2 0000:02:00.1 eno2: <--- start MCP states dump --->
[ 820.966284] bnx2 0000:02:00.1 eno2: DEBUG: MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e]
[ 820.966529] bnx2 0000:02:00.1 eno2: DEBUG: MCP mode[0000b880] state[80008000] evt_mask[00000500]
[ 820.966769] bnx2 0000:02:00.1 eno2: DEBUG: pc[080044b8] pc[080009f4] instr[30420003]
[ 820.967002] bnx2 0000:02:00.1 eno2: DEBUG: shmem states:
[ 820.967228] bnx2 0000:02:00.1 eno2: DEBUG: drv_mb[01030003] fw_mb[00000003] link_status[0000006f]
[ 820.967463] drv_pulse_mb[000002fa]
[ 820.967467] bnx2 0000:02:00.1 eno2: DEBUG: dev_info_signature[44564907] reset_type[01005254]
[ 820.967705] condition[0003610e] [ 820.967712] bnx2 0000:02:00.1 eno2: DEBUG: 000001c0: 01005254 42530000 0003610e 00000000
[ 820.967958] bnx2 0000:02:00.1 eno2: DEBUG: 000003cc: 00000000 00000000 00000000 00000000 [ 820.968354] bnx2 0000:02:00.1 eno2: DEBUG: 000003dc: 00000000 00000000 00000000 00000000
[ 820.968632] bnx2 0000:02:00.1 eno2: DEBUG: 000003ec: 00000000 00000000 00000000 00000000
[ 820.968887] bnx2 0000:02:00.1 eno2: DEBUG: 0x3fc[00000000]
[ 820.969134] bnx2 0000:02:00.1 eno2: <--- end MCP states dump --->
[ 821.001111] CE: hpet increased min_delta_ns to 67887 nsec
[ 821.081255] bnx2 0000:02:00.1 eno2: NIC Copper Link is Down
[ 824.312911] bnx2 0000:02:00.1 eno2: NIC Copper Link is Up, 1000 Mbps full duplex

Seem related to
https://forum.proxmox.com/threads/t...de-to-pve-6-due-config-issue-inside-vm.56716/

maybe can you try to reinstall kernel 4.15 from proxmox5 as workaround for now ?

gradinaruvasile · Sep 4, 2019

We had this issue with disintegrating cluster right after upgrading to PVE6. It also seemed to be related to some things happening in other parts of the network (we had a few vlans transported over the management links that span the whole network). But PVE 5 did not have this issue even with it's pain-in-the-behind multicast.
Also it appears that there were 2 separate problems:
- one involving corosync crash that did not seem to affect cluster integrity because it was restarted automatically (i see that this should not have happened but our logs say it was restarted automatically), but it randomly disabled HA on VMs (the workaround was to manually remove them from HA and add them back)
- another was related to corosync failing to maintain cluster integrity. In this case corosync did not crash as far as i know but the cluster somehow got fragmented and nodes rebooted. Workaround was stopping corosync on all nodes then starting it up but good luck with that if you have HA set up.

We moved the corosync traffic to a separate network (split the 4 link management LACP into 2x2) and no issues since.

ahovda · Sep 4, 2019

spirit said:
maybe can you try to reinstall kernel 4.15 from proxmox5 as workaround for now

Good idea. Is that going to work, or does some part of pve 6 depend on the new kernel? I'll give it a go and find out, I guess.

elmacus · Sep 4, 2019

ahovda said:
You're right, it did not really help; the whole cluster crashed and I'm in late to recover. I think I might have some problems with the bnx2 driver, since ethernet ports are going down and up.
(I know - I should have separate cluster fabric and 10Gb when I'm also playing with Ceph, but it used to be rock solid with corosync2)
Next step would be to rule out any issues with switches and cables.

I have no Ethernet problems in my logs, so i dont think that is the main problem, you should fix that anyway.
I also mix corosync with other traffic, but trying to break the bonds now and set up new network gave me other problems, but that is for another thread in future.

Today i installed a new pve-cluster 6.0-7, i hope this fixes some problems.

EDIT:
Got these right now:
Sep 4 10:23:29 server1 corosync[2290116]: [KNET ] link: host: 6 link: 0 is down
Sep 4 10:23:29 server1 corosync[2290116]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Sep 4 10:23:29 server1 corosync[2290116]: [KNET ] host: host: 6 has no active links
Sep 4 10:23:31 server1 corosync[2290116]: [KNET ] rx: host: 6 link: 0 is up

Sep 4 10:25:03 server1 corosync[2290116]: [KNET ] link: host: 8 link: 0 is down
Sep 4 10:25:03 server1 corosync[2290116]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Sep 4 10:25:03 server1 corosync[2290116]: [KNET ] host: host: 8 has no active links
Sep 4 10:25:05 server1 corosync[2290116]: [KNET ] rx: host: 8 link: 0 is up
Sep 4 10:25:05 server1 corosync[2290116]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)

Both these host 6 and 8 are Ceph nodes, but luckily Ceph has not chrashed at all to my knowledge. I have not lost any data. (except for the servers going down when cluster fails once and twice a week atleast). Minor cluster problems every day.

spirit · Sep 4, 2019

ahovda said:
Good idea. Is that going to work, or does some part of pve 6 depend on the new kernel? I'll give it a go and find out, I guess.

I think only zfs version could be impacted.

elmacus · Sep 4, 2019

spirit said:
I think only zfs version could be impacted.

So if you updated the pools for activating trim, you cant boot with old kernel ?

ahovda · Sep 4, 2019

spirit said:
I think only zfs version could be impacted.

Hehe, right. Guess I was a bit quick on the zpool upgrade command. Just a test machine on 4.15 so far, with nothing in the pool, but yes, the machine came up but zpool-import failed as expected.

Code:

root@osl108pve:~# /sbin/zpool import -aN -d /dev/disk/by-id -o cachefile=none
This pool uses the following feature(s) not supported by this system:
        org.zfsonlinux:project_quota (space/object accounting based on project ID.)
        com.delphix:spacemap_v2 (Space maps representing large segments are more efficient.)
All unsupported features are only required for writing to the pool.
The pool can be imported using '-o readonly=on'.
cannot import 'zfs-mirror-disk0': unsupported version or feature

f4242 · Sep 4, 2019

Hello,

I have a problem that seem related. I updated libknet on all my host (mix of 6.0 and 5.4 with upgraded corosync) but I still have problems with my cluster. My cluster is lost each night. I found that if I disable scheduled vzdump of my VMs, the cluster survives.

spirit · Sep 4, 2019

f4242 said:
Hello,

I have a problem that seem related. I updated libknet on all my host (mix of 6.0 and 5.4 with upgraded corosync) but I still have problems with my cluster. My cluster is lost each night. I found that if I disable scheduled vzdump of my VMs, the cluster survives.

Are you sure that your network link are not overloaded ?

f4242 · Sep 4, 2019

spirit said:
Are you sure that your network link are not overloaded ?

I have a single gigabit lan, so it can be somewhat loaded on backup time, but not more than before.

Also, I didn't checked since I upgraded libknet, but before that, I found it was consuming good amount of RAM when the cluster is broken. Everything is back to normal once I restart corosync on all hosts.

astnwt · Sep 7, 2019

Corosync 3 Coredump from today:
http://www.blackmesa.at/resources/c...7981683c0105059fad.37980.1567776827000000.lz4

hope that helps

Robert.H · Sep 7, 2019

Is this a generalized issue or only certain combination of hw/sw see this?
And what about a fresh install of V6, does it have issues with corosync v3?

brad_mssw · Sep 7, 2019

@astnwt you may want to add it to https://bugzilla.proxmox.com/show_bug.cgi?id=2326

I know we are waiting to upgrade to PVE 6 until we see a resolution on this as I haven't seen anyone state that a certain combination of hardware of configuration has been determined to cause this.

David Herselman · Sep 7, 2019

My understanding of this thread is that there are/were 3 issues:
- Corosync issues resolved with updated knetlib
- Corosync crashing (segmentation fault)
- Intermittent pauses, perhaps specific to bnx2

We manage 7 PVE 6 clusters which are now all stable, except for one that uses bnx2 network interfaces. Nothing is logged indicating an issue for the bnx2 network cards, which someone else was experiencing.

Our problematic cluster was perfect on PVE 4 and 5 and started having problems the moment we upgraded to 6.

brad_mssw · Sep 7, 2019

@David Herselman do you think the corosync crashes are also related to bnx2?

David Herselman · Sep 7, 2019

I don't believe so, corosync hasn't crashed on any of our nodes and switching to udpu made no difference so we're back on knet.

Our small HP system cluster, which has bnx2 NICs, is the only one still experiencing regular problems...

[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Renowned Member

New Member

Distinguished Member

Active Member

Distinguished Member

Renowned Member

Active Member

Renowned Member

Distinguished Member

Renowned Member

Active Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Active Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

We value your privacy