KNET/corosync Link down and instantly SSDs fail - interface fatal error

sigmarb

Well-Known Member
Nov 8, 2016
69
6
48
38
Hi folks,

on a specific Thomas Krenn Server (part of a 3 node cluster with ceph)

Manufacturer: Supermicro
Product Name: H11DSi-NT
Version: 2.00 with dual

AMD EPYC 7301 16-Core Processor

we see strange errors. corosync detects link failure and after that, a SSD reports errors. Anyone ever seen errors like this in that combination? Any help is greatly appreciated. We changed several times the SSDs. No change.

Linux DHPLPX06 5.15.74-1-pve #1 SMP PVE 5.15.74-1 (Mon, 14 Nov 2022 20:17:15 +0100) x86_64 GNU/Linux
proxmox-ve 7.2-1


Code:
Nov 30 17:00:23 DHPLPX06 kernel: [    2.224993] ata7.00: supports DRM functions and may not be fully accessible
Nov 30 17:00:23 DHPLPX06 kernel: [    2.224996] ata7.00: ATA-10: CT2000MX500SSD1, M3CR023, max UDMA/133
Nov 30 17:00:23 DHPLPX06 kernel: [    2.225026] ata7.00: 3907029168 sectors, multi 1: LBA48 NCQ (depth 32), AA
Nov 30 17:00:23 DHPLPX06 kernel: [    2.225767] ata7.00: Features: Trust Dev-Sleep
Nov 30 17:00:23 DHPLPX06 kernel: [    2.225880] ata7.00: supports DRM functions and may not be fully accessible
Nov 30 17:00:23 DHPLPX06 kernel: [    2.226633] ata7.00: configured for UDMA/133


Dec  1 09:19:53 DHPLPX06 corosync[4342]:   [KNET  ] link: host: 1 link: 1 is down
Dec  1 09:19:53 DHPLPX06 corosync[4342]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec  1 09:19:55 DHPLPX06 corosync[4342]:   [KNET  ] rx: host: 1 link: 1 is up
Dec  1 09:19:55 DHPLPX06 corosync[4342]:   [KNET  ] link: Resetting MTU for link 1 because host 1 joined
Dec  1 09:19:55 DHPLPX06 corosync[4342]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec  1 09:19:55 DHPLPX06 corosync[4342]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec  1 09:20:12 DHPLPX06 pmxcfs[4153]: [status] notice: received log
Dec  1 09:21:01 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d982
Dec  1 09:21:01 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d982
Dec  1 09:21:01 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d982
Dec  1 09:21:01 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d982
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:10 DHPLPX06 corosync[4342]:   [TOTEM ] Retransmit List: 5d9bb
Dec  1 09:21:47 DHPLPX06 corosync[4342]:   [KNET  ] link: host: 3 link: 1 is down
Dec  1 09:21:47 DHPLPX06 corosync[4342]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Dec  1 09:21:49 DHPLPX06 corosync[4342]:   [KNET  ] rx: host: 3 link: 1 is up
Dec  1 09:21:49 DHPLPX06 corosync[4342]:   [KNET  ] link: Resetting MTU for link 1 because host 3 joined
Dec  1 09:21:49 DHPLPX06 corosync[4342]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Dec  1 09:21:49 DHPLPX06 corosync[4342]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec  1 09:22:05 DHPLPX06 corosync[4342]:   [KNET  ] link: host: 1 link: 1 is down
Dec  1 09:22:05 DHPLPX06 corosync[4342]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec  1 09:22:07 DHPLPX06 corosync[4342]:   [KNET  ] rx: host: 1 link: 1 is up
Dec  1 09:22:07 DHPLPX06 corosync[4342]:   [KNET  ] link: Resetting MTU for link 1 because host 1 joined
Dec  1 09:22:07 DHPLPX06 corosync[4342]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec  1 09:22:07 DHPLPX06 corosync[4342]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014685] ata7.00: exception Emask 0x10 SAct 0x3e000 SErr 0x4c0000 action 0x6 frozen
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014751] ata7.00: irq_stat 0x08000000, interface fatal error
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014780] ata7: SError: { CommWake 10B8B Handshk }
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014806] ata7.00: failed command: WRITE FPDMA QUEUED
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014832] ata7.00: cmd 61/80:68:a0:94:9f/05:00:bb:00:00/40 tag 13 ncq dma 720896 out
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014832]          res 40/00:8c:20:ad:9f/00:00:bb:00:00/40 Emask 0x10 (ATA bus error)
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014904] ata7.00: status: { DRDY }
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014924] ata7.00: failed command: WRITE FPDMA QUEUED
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014949] ata7.00: cmd 61/80:70:20:9a:9f/07:00:bb:00:00/40 tag 14 ncq dma 983040 out
Dec  1 09:22:15 DHPLPX06 kernel: [58929.014949]          res 40/00:8c:20:ad:9f/00:00:bb:00:00/40 Emask 0x10 (ATA bus error)
Dec  1 09:22:15 DHPLPX06 kernel: [58929.015020] ata7.00: status: { DRDY }
Dec  1 09:22:15 DHPLPX06 kernel: [58929.015040] ata7.00: failed command: WRITE FPDMA QUEUED
Dec  1 09:22:15 DHPLPX06 kernel: [58929.015070] ata7.00: cmd 61/80:78:a0:a1:9f/06:00:bb:00:00/40 tag 15 ncq dma 851968 out
Dec  1 09:22:15 DHPLPX06 kernel: [58929.015070]          res 40/00:8c:20:ad:9f/00:00:bb:00:00/40 Emask 0x10 (ATA bus error)
Dec  1 09:22:15 DHPLPX06 kernel: [58929.015155] ata7.00: status: { DRDY }
 
Thank you for your time. Unfortunately not. We see those issues since more than 2 years with this specific server. Just installed latest 5.19 and will report back.

Update 05/11/22: Still same errors with 5.19.17-1-pve.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!