Thanks, all i needed to know. Keep up the good work.
No, im not a developer.and please read details on the links I posted.
Thanks, all i needed to know. Keep up the good work.
No, im not a developer.and please read details on the links I posted.
Yes and please read details on the links I posted.
You're right, it did not really help; the whole cluster crashed and I'm in late to recover. I think I might have some problems with the bnx2 driver, since ethernet ports are going down and up.@ahovda i did that (almost same, see thread above) but did not help my cluster. Please report again after a week if it helps you.
intremap=off
to the kernel cmdline and after rebooting nodes I'll see how it goes. I'm considering adding the bnx2 parameter disable_msi=1
as well.02:00.0 Ethernet controller: Broadcom Limited NetXtreme II BCM5716 Gigabit Ethernet (rev 20)
02:00.1 Ethernet controller: Broadcom Limited NetXtreme II BCM5716 Gigabit Ethernet (rev 20)
Seem related toYou're right, it did not really help; the whole cluster crashed and I'm in late to recover. I think I might have some problems with the bnx2 driver, since ethernet ports are going down and up.
I've searched and have addedintremap=off
to the kernel cmdline and after rebooting nodes I'll see how it goes. I'm considering adding the bnx2 parameterdisable_msi=1
as well.
These are the adapters:
Code:02:00.0 Ethernet controller: Broadcom Limited NetXtreme II BCM5716 Gigabit Ethernet (rev 20) 02:00.1 Ethernet controller: Broadcom Limited NetXtreme II BCM5716 Gigabit Ethernet (rev 20)
(I know - I should have separate cluster fabric and 10Gb when I'm also playing with Ceph, but it used to be rock solid with corosync2)
Next step would be to rule out any issues with switches and cables.
[ 820.947995] bnx2 0000:02:00.1 eno2: <--- start FTQ dump --->
[ 820.948477] bnx2 0000:02:00.1 eno2: RV2P_PFTQ_CTL 00010002
[ 820.948743] bnx2 0000:02:00.1 eno2: RV2P_TFTQ_CTL 00020000
[ 820.948997] bnx2 0000:02:00.1 eno2: RV2P_MFTQ_CTL 00004000
[ 820.949251] bnx2 0000:02:00.1 eno2: TBDR_FTQ_CTL 00004002
[ 820.949507] bnx2 0000:02:00.1 eno2: TDMA_FTQ_CTL 00010002
[ 820.949770] bnx2 0000:02:00.1 eno2: TXP_FTQ_CTL 00010002
[ 820.950031] bnx2 0000:02:00.1 eno2: TXP_FTQ_CTL 00010002
[ 820.950288] bnx2 0000:02:00.1 eno2: TPAT_FTQ_CTL 00010002
[ 820.950547] bnx2 0000:02:00.1 eno2: RXP_CFTQ_CTL 00008000
[ 820.950806] bnx2 0000:02:00.1 eno2: RXP_FTQ_CTL 00100000
[ 820.951066] bnx2 0000:02:00.1 eno2: COM_COMXQ_FTQ_CTL 00010000
[ 820.951329] bnx2 0000:02:00.1 eno2: COM_COMTQ_FTQ_CTL 00020000
[ 820.951592] bnx2 0000:02:00.1 eno2: COM_COMQ_FTQ_CTL 00010000
[ 820.951857] bnx2 0000:02:00.1 eno2: CP_CPQ_FTQ_CTL 00004000
[ 820.952255] bnx2 0000:02:00.1 eno2: CPU states:
[ 820.952567] bnx2 0000:02:00.1 eno2: 045000 mode b84c state 80005000 evt_mask 500 pc 8001294 pc 800128c instr 8e260000
[ 820.952884] bnx2 0000:02:00.1 eno2: 085000 mode b84c state 80001000 evt_mask 500 pc 8000a4c pc 8000a4c instr 38420001
[ 820.953196] bnx2 0000:02:00.1 eno2: 0c5000 mode b84c state 80001000 evt_mask 500 pc 8004c14 pc 8004c14 instr 32050003
[ 820.953511] bnx2 0000:02:00.1 eno2: 105000 mode b8cc state 80004000 evt_mask 500 pc 8000a9c pc 8000a9c instr 32620007
[ 820.953827] bnx2 0000:02:00.1 eno2: 145000 mode b880 state 80000000 evt_mask 500 pc 800b5ec pc 8000104 instr a0b821
[ 820.954147] bnx2 0000:02:00.1 eno2: 185000 mode b8cc state 80000000 evt_mask 500 pc 8000c6c pc 8000c74 instr 3c058000
[ 820.954460] bnx2 0000:02:00.1 eno2: <--- end FTQ dump --->
[ 820.954775] bnx2 0000:02:00.1 eno2: <--- start TBDC dump --->
[ 820.955094] bnx2 0000:02:00.1 eno2: TBDC free cnt: 32
[ 820.955414] bnx2 0000:02:00.1 eno2: LINE CID BIDX CMD VALIDS
[ 820.955747] bnx2 0000:02:00.1 eno2: 00 001300 6a00 00 [0]
[ 820.956198] bnx2 0000:02:00.1 eno2: 01 001200 a4f8 00 [0]
[ 820.956556] bnx2 0000:02:00.1 eno2: 02 001300 6a00 00 [0]
[ 820.956886] bnx2 0000:02:00.1 eno2: 03 001100 d308 00 [0]
[ 820.957201] bnx2 0000:02:00.1 eno2: 04 001280 3680 00 [0]
[ 820.957510] bnx2 0000:02:00.1 eno2: 05 001100 ce70 00 [0]
[ 820.957814] bnx2 0000:02:00.1 eno2: 06 001100 cb48 00 [0]
[ 820.958113] bnx2 0000:02:00.1 eno2: 07 001100 cb50 00 [0]
[ 820.958408] bnx2 0000:02:00.1 eno2: 08 001300 63a0 00 [0]
[ 820.958695] bnx2 0000:02:00.1 eno2: 09 001080 2760 00 [0]
[ 820.958975] bnx2 0000:02:00.1 eno2: 0a 000800 6990 00 [0]
[ 820.959250] bnx2 0000:02:00.1 eno2: 0b 001300 6368 00 [0]
[ 820.959507] bnx2 0000:02:00.1 eno2: 0c 001200 9c88 00 [0]
[ 820.959757] bnx2 0000:02:00.1 eno2: 0d 001200 9c90 00 [0]
[ 820.959999] bnx2 0000:02:00.1 eno2: 0e 001180 4618 00 [0]
[ 820.960431] bnx2 0000:02:00.1 eno2: 0f 001100 80c8 00 [0]
[ 820.960696] bnx2 0000:02:00.1 eno2: 10 001300 1810 00 [0]
[ 820.960957] bnx2 0000:02:00.1 eno2: 11 001200 1978 00 [0]
[ 820.961220] bnx2 0000:02:00.1 eno2: 12 001100 80d8 00 [0]
[ 820.961477] bnx2 0000:02:00.1 eno2: 13 001000 65b0 00 [0]
[ 820.961730] bnx2 0000:02:00.1 eno2: 14 001100 80d0 00 [0]
[ 820.961978] bnx2 0000:02:00.1 eno2: 15 001100 80e0 00 [0]
[ 820.962219] bnx2 0000:02:00.1 eno2: 16 001280 e5f8 00 [0]
[ 820.962458] bnx2 0000:02:00.1 eno2: 17 001280 e600 00 [0]
[ 820.962696] bnx2 0000:02:00.1 eno2: 18 001200 1948 00 [0]
[ 820.962936] bnx2 0000:02:00.1 eno2: 19 001200 1950 00 [0]
[ 820.963172] bnx2 0000:02:00.1 eno2: 1a 001280 e5d0 00 [0]
[ 820.963408] bnx2 0000:02:00.1 eno2: 1b 001280 e5d8 00 [0]
[ 820.963642] bnx2 0000:02:00.1 eno2: 1c 001280 e5e0 00 [0]
[ 820.963876] bnx2 0000:02:00.1 eno2: 1d 001280 e5e8 00 [0]
[ 820.964122] bnx2 0000:02:00.1 eno2: 1e 001180 f7c0 00 [0]
[ 820.964355] bnx2 0000:02:00.1 eno2: 1f 001180 f7c8 00 [0]
[ 820.964562] bnx2 0000:02:00.1 eno2: <--- end TBDC dump --->
[ 820.964771] bnx2 0000:02:00.1 eno2: DEBUG: intr_sem[0] PCI_CMD[00100406]
[ 820.964985] bnx2 0000:02:00.1 eno2: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000088]
[ 820.965200] bnx2 0000:02:00.1 eno2: DEBUG: EMAC_TX_STATUS[0000000e] EMAC_RX_STATUS[00000000]
[ 820.965420] bnx2 0000:02:00.1 eno2: DEBUG: RPM_MGMT_PKT_CTRL[40000088]
[ 820.965640] bnx2 0000:02:00.1 eno2: DEBUG: HC_STATS_INTERRUPT_STATUS[01fe0001]
[ 820.965854] bnx2 0000:02:00.1 eno2: DEBUG: PBA[00000000]
[ 820.966065] bnx2 0000:02:00.1 eno2: <--- start MCP states dump --->
[ 820.966284] bnx2 0000:02:00.1 eno2: DEBUG: MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e]
[ 820.966529] bnx2 0000:02:00.1 eno2: DEBUG: MCP mode[0000b880] state[80008000] evt_mask[00000500]
[ 820.966769] bnx2 0000:02:00.1 eno2: DEBUG: pc[080044b8] pc[080009f4] instr[30420003]
[ 820.967002] bnx2 0000:02:00.1 eno2: DEBUG: shmem states:
[ 820.967228] bnx2 0000:02:00.1 eno2: DEBUG: drv_mb[01030003] fw_mb[00000003] link_status[0000006f]
[ 820.967463] drv_pulse_mb[000002fa]
[ 820.967467] bnx2 0000:02:00.1 eno2: DEBUG: dev_info_signature[44564907] reset_type[01005254]
[ 820.967705] condition[0003610e] [ 820.967712] bnx2 0000:02:00.1 eno2: DEBUG: 000001c0: 01005254 42530000 0003610e 00000000
[ 820.967958] bnx2 0000:02:00.1 eno2: DEBUG: 000003cc: 00000000 00000000 00000000 00000000 [ 820.968354] bnx2 0000:02:00.1 eno2: DEBUG: 000003dc: 00000000 00000000 00000000 00000000
[ 820.968632] bnx2 0000:02:00.1 eno2: DEBUG: 000003ec: 00000000 00000000 00000000 00000000
[ 820.968887] bnx2 0000:02:00.1 eno2: DEBUG: 0x3fc[00000000]
[ 820.969134] bnx2 0000:02:00.1 eno2: <--- end MCP states dump --->
[ 821.001111] CE: hpet increased min_delta_ns to 67887 nsec
[ 821.081255] bnx2 0000:02:00.1 eno2: NIC Copper Link is Down
[ 824.312911] bnx2 0000:02:00.1 eno2: NIC Copper Link is Up, 1000 Mbps full duplex
Good idea. Is that going to work, or does some part of pve 6 depend on the new kernel? I'll give it a go and find out, I guess.maybe can you try to reinstall kernel 4.15 from proxmox5 as workaround for now
I have no Ethernet problems in my logs, so i dont think that is the main problem, you should fix that anyway.You're right, it did not really help; the whole cluster crashed and I'm in late to recover. I think I might have some problems with the bnx2 driver, since ethernet ports are going down and up.
(I know - I should have separate cluster fabric and 10Gb when I'm also playing with Ceph, but it used to be rock solid with corosync2)
Next step would be to rule out any issues with switches and cables.
I think only zfs version could be impacted.Good idea. Is that going to work, or does some part of pve 6 depend on the new kernel? I'll give it a go and find out, I guess.
So if you updated the pools for activating trim, you cant boot with old kernel ?I think only zfs version could be impacted.
Hehe, right. Guess I was a bit quick on the zpool upgrade command. Just a test machine on 4.15 so far, with nothing in the pool, but yes, the machine came up but zpool-import failed as expected.I think only zfs version could be impacted.
root@osl108pve:~# /sbin/zpool import -aN -d /dev/disk/by-id -o cachefile=none
This pool uses the following feature(s) not supported by this system:
org.zfsonlinux:project_quota (space/object accounting based on project ID.)
com.delphix:spacemap_v2 (Space maps representing large segments are more efficient.)
All unsupported features are only required for writing to the pool.
The pool can be imported using '-o readonly=on'.
cannot import 'zfs-mirror-disk0': unsupported version or feature
Are you sure that your network link are not overloaded ?Hello,
I have a problem that seem related. I updated libknet on all my host (mix of 6.0 and 5.4 with upgraded corosync) but I still have problems with my cluster. My cluster is lost each night. I found that if I disable scheduled vzdump of my VMs, the cluster survives.
Are you sure that your network link are not overloaded ?
We use essential cookies to make this site work, and optional cookies to enhance your experience.