Hi folks,
on a specific Thomas Krenn Server (part of a 3 node cluster with ceph)
Manufacturer: Supermicro
Product Name: H11DSi-NT
Version: 2.00 with dual
AMD EPYC 7301 16-Core Processor
we see strange errors. corosync detects link failure and after that, a SSD reports errors. Anyone ever seen errors like this in that combination? Any help is greatly appreciated. We changed several times the SSDs. No change.
Linux DHPLPX06 5.15.74-1-pve #1 SMP PVE 5.15.74-1 (Mon, 14 Nov 2022 20:17:15 +0100) x86_64 GNU/Linux
proxmox-ve 7.2-1
on a specific Thomas Krenn Server (part of a 3 node cluster with ceph)
Manufacturer: Supermicro
Product Name: H11DSi-NT
Version: 2.00 with dual
AMD EPYC 7301 16-Core Processor
we see strange errors. corosync detects link failure and after that, a SSD reports errors. Anyone ever seen errors like this in that combination? Any help is greatly appreciated. We changed several times the SSDs. No change.
Linux DHPLPX06 5.15.74-1-pve #1 SMP PVE 5.15.74-1 (Mon, 14 Nov 2022 20:17:15 +0100) x86_64 GNU/Linux
proxmox-ve 7.2-1
Code:
Nov 30 17:00:23 DHPLPX06 kernel: [ 2.224993] ata7.00: supports DRM functions and may not be fully accessible
Nov 30 17:00:23 DHPLPX06 kernel: [ 2.224996] ata7.00: ATA-10: CT2000MX500SSD1, M3CR023, max UDMA/133
Nov 30 17:00:23 DHPLPX06 kernel: [ 2.225026] ata7.00: 3907029168 sectors, multi 1: LBA48 NCQ (depth 32), AA
Nov 30 17:00:23 DHPLPX06 kernel: [ 2.225767] ata7.00: Features: Trust Dev-Sleep
Nov 30 17:00:23 DHPLPX06 kernel: [ 2.225880] ata7.00: supports DRM functions and may not be fully accessible
Nov 30 17:00:23 DHPLPX06 kernel: [ 2.226633] ata7.00: configured for UDMA/133
Dec 1 09:19:53 DHPLPX06 corosync[4342]: [KNET ] link: host: 1 link: 1 is down
Dec 1 09:19:53 DHPLPX06 corosync[4342]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 1 09:19:55 DHPLPX06 corosync[4342]: [KNET ] rx: host: 1 link: 1 is up
Dec 1 09:19:55 DHPLPX06 corosync[4342]: [KNET ] link: Resetting MTU for link 1 because host 1 joined
Dec 1 09:19:55 DHPLPX06 corosync[4342]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 1 09:19:55 DHPLPX06 corosync[4342]: [KNET ] pmtud: Global data MTU changed to: 1397
Dec 1 09:20:12 DHPLPX06 pmxcfs[4153]: [status] notice: received log
Dec 1 09:21:01 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d982
Dec 1 09:21:01 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d982
Dec 1 09:21:01 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d982
Dec 1 09:21:01 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d982
Dec 1 09:21:10 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d9bb
Dec 1 09:21:10 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d9bb
Dec 1 09:21:10 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d9bb
Dec 1 09:21:10 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d9bb
Dec 1 09:21:10 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d9bb
Dec 1 09:21:10 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d9bb
Dec 1 09:21:10 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d9bb
Dec 1 09:21:10 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d9bb
Dec 1 09:21:10 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d9bb
Dec 1 09:21:10 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d9bb
Dec 1 09:21:10 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d9bb
Dec 1 09:21:10 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d9bb
Dec 1 09:21:10 DHPLPX06 corosync[4342]: [TOTEM ] Retransmit List: 5d9bb
Dec 1 09:21:47 DHPLPX06 corosync[4342]: [KNET ] link: host: 3 link: 1 is down
Dec 1 09:21:47 DHPLPX06 corosync[4342]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Dec 1 09:21:49 DHPLPX06 corosync[4342]: [KNET ] rx: host: 3 link: 1 is up
Dec 1 09:21:49 DHPLPX06 corosync[4342]: [KNET ] link: Resetting MTU for link 1 because host 3 joined
Dec 1 09:21:49 DHPLPX06 corosync[4342]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Dec 1 09:21:49 DHPLPX06 corosync[4342]: [KNET ] pmtud: Global data MTU changed to: 1397
Dec 1 09:22:05 DHPLPX06 corosync[4342]: [KNET ] link: host: 1 link: 1 is down
Dec 1 09:22:05 DHPLPX06 corosync[4342]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 1 09:22:07 DHPLPX06 corosync[4342]: [KNET ] rx: host: 1 link: 1 is up
Dec 1 09:22:07 DHPLPX06 corosync[4342]: [KNET ] link: Resetting MTU for link 1 because host 1 joined
Dec 1 09:22:07 DHPLPX06 corosync[4342]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 1 09:22:07 DHPLPX06 corosync[4342]: [KNET ] pmtud: Global data MTU changed to: 1397
Dec 1 09:22:15 DHPLPX06 kernel: [58929.014685] ata7.00: exception Emask 0x10 SAct 0x3e000 SErr 0x4c0000 action 0x6 frozen
Dec 1 09:22:15 DHPLPX06 kernel: [58929.014751] ata7.00: irq_stat 0x08000000, interface fatal error
Dec 1 09:22:15 DHPLPX06 kernel: [58929.014780] ata7: SError: { CommWake 10B8B Handshk }
Dec 1 09:22:15 DHPLPX06 kernel: [58929.014806] ata7.00: failed command: WRITE FPDMA QUEUED
Dec 1 09:22:15 DHPLPX06 kernel: [58929.014832] ata7.00: cmd 61/80:68:a0:94:9f/05:00:bb:00:00/40 tag 13 ncq dma 720896 out
Dec 1 09:22:15 DHPLPX06 kernel: [58929.014832] res 40/00:8c:20:ad:9f/00:00:bb:00:00/40 Emask 0x10 (ATA bus error)
Dec 1 09:22:15 DHPLPX06 kernel: [58929.014904] ata7.00: status: { DRDY }
Dec 1 09:22:15 DHPLPX06 kernel: [58929.014924] ata7.00: failed command: WRITE FPDMA QUEUED
Dec 1 09:22:15 DHPLPX06 kernel: [58929.014949] ata7.00: cmd 61/80:70:20:9a:9f/07:00:bb:00:00/40 tag 14 ncq dma 983040 out
Dec 1 09:22:15 DHPLPX06 kernel: [58929.014949] res 40/00:8c:20:ad:9f/00:00:bb:00:00/40 Emask 0x10 (ATA bus error)
Dec 1 09:22:15 DHPLPX06 kernel: [58929.015020] ata7.00: status: { DRDY }
Dec 1 09:22:15 DHPLPX06 kernel: [58929.015040] ata7.00: failed command: WRITE FPDMA QUEUED
Dec 1 09:22:15 DHPLPX06 kernel: [58929.015070] ata7.00: cmd 61/80:78:a0:a1:9f/06:00:bb:00:00/40 tag 15 ncq dma 851968 out
Dec 1 09:22:15 DHPLPX06 kernel: [58929.015070] res 40/00:8c:20:ad:9f/00:00:bb:00:00/40 Emask 0x10 (ATA bus error)
Dec 1 09:22:15 DHPLPX06 kernel: [58929.015155] ata7.00: status: { DRDY }