KNET/corosync Link down and instantly SSDs fail - interface fatal error

sigmarb

Renowned Member
Nov 8, 2016
71
6
73
39
Hi folks,

on a specific Thomas Krenn Server (part of a 3 node cluster with ceph)

Manufacturer: Supermicro
Product Name: H11DSi-NT
Version: 2.00 with dual

AMD EPYC 7301 16-Core Processor

we see strange errors. corosync detects link failure and after that, a SSD reports errors. Anyone ever seen errors like this in that combination? Any help is greatly appreciated. We changed several times the SSDs. No change.

Linux DHPLPX06 5.15.74-1-pve #1 SMP PVE 5.15.74-1 (Mon, 14 Nov 2022 20:17:15 +0100) x86_64 GNU/Linux
proxmox-ve 7.2-1


Code:
Nov 30 17:00:23 DHPLPX06 kernel: [    2.224993] ata7.00: supports DRM functions and may not be fully accessible
Nov 30 17:00:23 DHPLPX06 kernel: [    2.224996] ata7.00: ATA-10: CT2000MX500SSD1, M3CR023, max UDMA/133
Nov 30 17:00:23 DHPLPX06 kernel: [    2.225026] ata7.00: 3907029168 sectors, multi 1: LBA48 NCQ (depth 32), AA
Nov 30 17:00:23 DHPLPX06 kernel: [    2.225767] ata7.00: Features: Trust Dev-Sleep
Nov 30 17:00:23 DHPLPX06 kernel: [    2.225880] ata7.00: supports DRM functions and may not be fully accessible
Nov 30 17:00:23 DHPLPX06 kernel: [    2.226633] ata7.00: configured for UDMA/133


Dec  1 09:19:53 DHPLPX06 corosync[4342]:   [KNET  ] link: host: 1 link: 1 is down
Dec  1 09:19:53 DHPLPX06 corosync[4342]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec  1 09:19:55 DHPLPX06 corosync[4342]:   [KNET  ] rx: host: 1 link: 1 is up
Dec  1 09:19:55 DHPLPX06 corosync[4342]:   [KNET  ] link: Resetting MTU for link 1 because host 1 joined
Dec  1 09:19:55 DHPLPX06 corosync[4342]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec  1 09:19:55 DHPLPX06 corosync[4342]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec  1 09:20:12 DHPLPX06 pmxcfs[4153]: [status] notice: received log
Dec  1 09:21:01 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d982
Dec  1 09:21:01 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d982
Dec  1 09:21:01 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d982
Dec  1 09:21:01 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d982
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:47 DHPLPX06 corosync[4342]:   [KNET  ] link: host: 3 link: 1 is down
Dec  1 09:21:47 DHPLPX06 corosync[4342]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Dec  1 09:21:49 DHPLPX06 corosync[4342]:   [KNET  ] rx: host: 3 link: 1 is up
Dec  1 09:21:49 DHPLPX06 corosync[4342]:   [KNET  ] link: Resetting MTU for link 1 because host 3 joined
Dec  1 09:21:49 DHPLPX06 corosync[4342]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Dec  1 09:21:49 DHPLPX06 corosync[4342]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec  1 09:22:05 DHPLPX06 corosync[4342]:   [KNET  ] link: host: 1 link: 1 is down
Dec  1 09:22:05 DHPLPX06 corosync[4342]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec  1 09:22:07 DHPLPX06 corosync[4342]:   [KNET  ] rx: host: 1 link: 1 is up
Dec  1 09:22:07 DHPLPX06 corosync[4342]:   [KNET  ] link: Resetting MTU for link 1 because host 1 joined
Dec  1 09:22:07 DHPLPX06 corosync[4342]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec  1 09:22:07 DHPLPX06 corosync[4342]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014685] ata7.00: exception Emask 0x10 SAct 0x3e000 SErr 0x4c0000 action 0x6 frozen
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014751] ata7.00: irq_stat 0x08000000, interface fatal error
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014780] ata7: SError: { CommWake 10B8B Handshk }
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014806] ata7.00: failed command: WRITE FPDMA QUEUED
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014832] ata7.00: cmd 61/80:68:a0:94:9f/05:00:bb:00:00/40 tag 13 ncq dma 720896 out
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014832]          res 40/00:8c:20:ad:9f/00:00:bb:00:00/40 Emask 0x10 (ATA bus error)
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014904] ata7.00: status: { DRDY }
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014924] ata7.00: failed command: WRITE FPDMA QUEUED
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014949] ata7.00: cmd 61/80:70:20:9a:9f/07:00:bb:00:00/40 tag 14 ncq dma 983040 out
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014949]          res 40/00:8c:20:ad:9f/00:00:bb:00:00/40 Emask 0x10 (ATA bus error)
Dec  1 09:22:15 DHPLPX06 kernel: [58929.015020] ata7.00: status: { DRDY }
Dec  1 09:22:15 DHPLPX06 kernel: [58929.015040] ata7.00: failed command: WRITE FPDMA QUEUED
Dec  1 09:22:15 DHPLPX06 kernel: [58929.015070] ata7.00: cmd 61/80:78:a0:a1:9f/06:00:bb:00:00/40 tag 15 ncq dma 851968 out
Dec  1 09:22:15 DHPLPX06 kernel: [58929.015070]          res 40/00:8c:20:ad:9f/00:00:bb:00:00/40 Emask 0x10 (ATA bus error)
Dec  1 09:22:15 DHPLPX06 kernel: [58929.015155] ata7.00: status: { DRDY }
 
Thank you for your time. Unfortunately not. We see those issues since more than 2 years with this specific server. Just installed latest 5.19 and will report back.

Update 05/11/22: Still same errors with 5.19.17-1-pve.
 
Last edited: