Storage lost

Aleksej

Member
Feb 25, 2018
25
0
6
35
Hello.

Time by time i have a problem that my SSDs have an errors an storage is unaccessible.
A few lines from log:

Apr 6 12:34:00 ps13 systemd[1]: Started Proxmox VE replication runner.
Apr 6 12:35:00 ps13 systemd[1]: Starting Proxmox VE replication runner...
Apr 6 12:35:00 ps13 systemd[1]: pvesr.service: Succeeded.
Apr 6 12:35:00 ps13 systemd[1]: Started Proxmox VE replication runner.
Apr 6 12:36:00 ps13 systemd[1]: Starting Proxmox VE replication runner...
Apr 6 12:36:00 ps13 systemd[1]: pvesr.service: Succeeded.
Apr 6 12:36:00 ps13 systemd[1]: Started Proxmox VE replication runner.
Apr 6 12:36:46 ps13 kernel: [362858.768260] ahci 0000:02:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0xfca8e000 flags=0x0000]
Apr 6 12:36:47 ps13 kernel: [362859.050225] ata2.00: exception Emask 0x10 SAct 0x700063ff SErr 0x0 action 0x6 frozen
Apr 6 12:36:47 ps13 kernel: [362859.050227] ata2.00: irq_stat 0x08000000, interface fatal error
Apr 6 12:36:47 ps13 kernel: [362859.050230] ata2.00: failed command: WRITE FPDMA QUEUED
Apr 6 12:36:47 ps13 kernel: [362859.050232] ata2.00: cmd 61/08:00:58:c3:36/00:00:0a:00:00/40 tag 0 ncq dma 4096 out
Apr 6 12:36:47 ps13 kernel: [362859.050232] res 40/00:48:58:44:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 6 12:36:47 ps13 kernel: [362859.050235] ata2.00: status: { DRDY }
Apr 6 12:36:47 ps13 kernel: [362859.050236] ata2.00: failed command: WRITE FPDMA QUEUED
Apr 6 12:36:47 ps13 kernel: [362859.050238] ata2.00: cmd 61/08:08:f8:2b:3f/00:00:0a:00:00/40 tag 1 ncq dma 4096 out
Apr 6 12:36:47 ps13 kernel: [362859.050238] res 40/00:48:58:44:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 6 12:36:47 ps13 kernel: [362859.050241] ata2.00: status: { DRDY }
Apr 6 12:36:47 ps13 kernel: [362859.050242] ata2.00: failed command: WRITE FPDMA QUEUED
Apr 6 12:36:47 ps13 kernel: [362859.050245] ata2.00: cmd 61/08:10:60:2c:3f/00:00:0a:00:00/40 tag 2 ncq dma 4096 out
Apr 6 12:36:47 ps13 kernel: [362859.050245] res 40/00:48:58:44:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 6 12:36:47 ps13 kernel: [362859.050247] ata2.00: status: { DRDY }
Apr 6 12:36:47 ps13 kernel: [362859.050248] ata2.00: failed command: WRITE FPDMA QUEUED
Apr 6 12:36:47 ps13 kernel: [362859.050251] ata2.00: cmd 61/10:18:c8:74:de/00:00:01:00:00/40 tag 3 ncq dma 8192 out
Apr 6 12:36:47 ps13 kernel: [362859.050251] res 40/00:48:58:44:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 6 12:36:47 ps13 kernel: [362859.050253] ata2.00: status: { DRDY }
Apr 6 12:36:47 ps13 kernel: [362859.050254] ata2.00: failed command: WRITE FPDMA QUEUED
Apr 6 12:36:47 ps13 kernel: [362859.050257] ata2.00: cmd 61/08:20:08:c6:01/00:00:02:00:00/40 tag 4 ncq dma 4096 out
Apr 6 12:36:47 ps13 kernel: [362859.050257] res 40/00:48:58:44:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 6 12:36:47 ps13 kernel: [362859.050259] ata2.00: status: { DRDY }
Apr 6 12:36:47 ps13 kernel: [362859.050260] ata2.00: failed command: WRITE FPDMA QUEUED
Apr 6 12:36:47 ps13 kernel: [362859.050263] ata2.00: cmd 61/08:28:18:c6:01/00:00:02:00:00/40 tag 5 ncq dma 4096 out
Apr 6 12:36:47 ps13 kernel: [362859.050263] res 40/00:48:58:44:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 6 12:36:47 ps13 kernel: [362859.050265] ata2.00: status: { DRDY }
Apr 6 12:36:47 ps13 kernel: [362859.050267] ata2.00: failed command: WRITE FPDMA QUEUED
Apr 6 12:36:47 ps13 kernel: [362859.050269] ata2.00: cmd 61/30:30:c0:02:02/00:00:02:00:00/40 tag 6 ncq dma 24576 out
Apr 6 12:36:47 ps13 kernel: [362859.050269] res 40/00:48:58:44:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 6 12:36:47 ps13 kernel: [362859.050271] ata2.00: status: { DRDY }
Apr 6 12:36:47 ps13 kernel: [362859.050273] ata2.00: failed command: WRITE FPDMA QUEUED
Apr 6 12:36:47 ps13 kernel: [362859.050275] ata2.00: cmd 61/08:38:00:39:4e/00:00:03:00:00/40 tag 7 ncq dma 4096 out
Apr 6 12:36:47 ps13 kernel: [362859.050275] res 40/00:48:58:44:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 6 12:36:47 ps13 kernel: [362859.050277] ata2.00: status: { DRDY }
Apr 6 12:36:47 ps13 kernel: [362859.050279] ata2.00: failed command: WRITE FPDMA QUEUED
Apr 6 12:36:47 ps13 kernel: [362859.050281] ata2.00: cmd 61/08:40:a8:3d:4e/00:00:03:00:00/40 tag 8 ncq dma 4096 out
Apr 6 12:36:47 ps13 kernel: [362859.050281] res 40/00:48:58:44:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 6 12:36:47 ps13 kernel: [362859.050283] ata2.00: status: { DRDY }
Apr 6 12:36:47 ps13 kernel: [362859.050285] ata2.00: failed command: WRITE FPDMA QUEUED
Apr 6 12:36:47 ps13 kernel: [362859.050287] ata2.00: cmd 61/10:48:58:44:4e/00:00:03:00:00/40 tag 9 ncq dma 8192 out
Apr 6 12:36:47 ps13 kernel: [362859.050287] res 40/00:48:58:44:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 6 12:36:47 ps13 kernel: [362859.050289] ata2.00: status: { DRDY }
Apr 6 12:36:47 ps13 kernel: [362859.050291] ata2.00: failed command: READ FPDMA QUEUED
Apr 6 12:36:47 ps13 kernel: [362859.050293] ata2.00: cmd 60/08:68:40:ae:43/00:00:73:00:00/40 tag 13 ncq dma 4096 in
Apr 6 12:36:47 ps13 kernel: [362859.050293] res 40/00:48:58:44:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 6 12:36:47 ps13 kernel: [362859.050296] ata2.00: status: { DRDY }
Apr 6 12:36:47 ps13 kernel: [362859.050297] ata2.00: failed command: READ FPDMA QUEUED
Apr 6 12:36:47 ps13 kernel: [362859.050299] ata2.00: cmd 60/00:70:00:00:00/01:00:00:00:00/40 tag 14 ncq dma 131072 in
Apr 6 12:36:47 ps13 kernel: [362859.050299] res 40/00:48:58:44:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 6 12:36:47 ps13 kernel: [362859.050302] ata2.00: status: { DRDY }
Apr 6 12:36:47 ps13 kernel: [362859.050303] ata2.00: failed command: WRITE FPDMA QUEUED
Apr 6 12:36:47 ps13 kernel: [362859.050305] ata2.00: cmd 61/08:e0:90:c2:4e/00:00:08:00:00/40 tag 28 ncq dma 4096 out
Apr 6 12:36:47 ps13 kernel: [362859.050305] res 40/00:48:58:44:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 6 12:36:47 ps13 kernel: [362859.050308] ata2.00: status: { DRDY }
Apr 6 12:36:47 ps13 kernel: [362859.050309] ata2.00: failed command: WRITE FPDMA QUEUED
Apr 6 12:36:47 ps13 kernel: [362859.050311] ata2.00: cmd 61/08:e8:c0:8e:4f/00:00:08:00:00/40 tag 29 ncq dma 4096 out
Apr 6 12:36:47 ps13 kernel: [362859.050311] res 40/00:48:58:44:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 6 12:36:47 ps13 kernel: [362859.050314] ata2.00: status: { DRDY }
Apr 6 12:36:47 ps13 kernel: [362859.050315] ata2.00: failed command: WRITE FPDMA QUEUED
Apr 6 12:36:47 ps13 kernel: [362859.050317] ata2.00: cmd 61/10:f0:00:08:50/00:00:08:00:00/40 tag 30 ncq dma 8192 out
Apr 6 12:36:47 ps13 kernel: [362859.050317] res 40/00:48:58:44:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 6 12:36:47 ps13 kernel: [362859.050320] ata2.00: status: { DRDY }
Apr 6 12:36:47 ps13 kernel: [362859.050323] ata2: hard resetting link
Apr 6 12:36:57 ps13 kernel: [362869.051079] ata2: softreset failed (1st FIS failed)
Apr 6 12:36:57 ps13 kernel: [362869.051085] ata2: hard resetting link
Apr 6 12:37:00 ps13 systemd[1]: Starting Proxmox VE replication runner...
Apr 6 12:37:00 ps13 systemd[1]: pvesr.service: Succeeded.
Apr 6 12:37:00 ps13 systemd[1]: Started Proxmox VE replication runner.
Apr 6 12:37:07 ps13 kernel: [362879.050916] ata2: softreset failed (1st FIS failed)
Apr 6 12:37:07 ps13 kernel: [362879.050922] ata2: hard resetting link
Apr 6 12:37:17 ps13 kernel: [362889.002142] ata6.00: exception Emask 0x0 SAct 0x400000 SErr 0x0 action 0x6 frozen
Apr 6 12:37:17 ps13 kernel: [362889.002148] ata6.00: failed command: READ FPDMA QUEUED
Apr 6 12:37:17 ps13 kernel: [362889.002152] ata6.00: cmd 60/00:b0:00:00:00/01:00:00:00:00/40 tag 22 ncq dma 131072 in
Apr 6 12:37:17 ps13 kernel: [362889.002152] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr 6 12:37:17 ps13 kernel: [362889.002155] ata6.00: status: { DRDY }
Apr 6 12:37:17 ps13 kernel: [362889.002158] ata6: hard resetting link
Apr 6 12:37:17 ps13 kernel: [362889.002169] ata5.00: exception Emask 0x0 SAct 0x400000 SErr 0x0 action 0x6 frozen
Apr 6 12:37:17 ps13 kernel: [362889.002173] ata5.00: failed command: READ FPDMA QUEUED
Apr 6 12:37:17 ps13 kernel: [362889.002176] ata5.00: cmd 60/00:b0:00:00:00/01:00:00:00:00/40 tag 22 ncq dma 131072 in
Apr 6 12:37:17 ps13 kernel: [362889.002176] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr 6 12:37:17 ps13 kernel: [362889.002180] ata5.00: status: { DRDY }
Apr 6 12:37:17 ps13 kernel: [362889.002182] ata5: hard resetting link
Apr 6 12:37:17 ps13 kernel: [362889.002191] ata1.00: exception Emask 0x0 SAct 0x3000 SErr 0x0 action 0x6 frozen
Apr 6 12:37:17 ps13 kernel: [362889.002193] ata1.00: failed command: READ FPDMA QUEUED
Apr 6 12:37:17 ps13 kernel: [362889.002196] ata1.00: cmd 60/00:60:00:00:00/01:00:00:00:00/40 tag 12 ncq dma 131072 in
Apr 6 12:37:17 ps13 kernel: [362889.002196] res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr 6 12:37:17 ps13 kernel: [362889.002198] ata1.00: status: { DRDY }
Apr 6 12:37:17 ps13 kernel: [362889.002199] ata1.00: failed command: READ FPDMA QUEUED
Apr 6 12:37:17 ps13 kernel: [362889.002202] ata1.00: cmd 60/00:68:00:08:00/01:00:00:00:00/40 tag 13 ncq dma 131072 in
Apr 6 12:37:17 ps13 kernel: [362889.002202] res 40/00:ff:ff:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr 6 12:37:17 ps13 kernel: [362889.002204] ata1.00: status: { DRDY }
Apr 6 12:37:17 ps13 kernel: [362889.002206] ata1: hard resetting link
Apr 6 12:37:27 ps13 kernel: [362899.002609] ata1: softreset failed (1st FIS failed)
Apr 6 12:37:27 ps13 kernel: [362899.002615] ata1: hard resetting link
Apr 6 12:37:27 ps13 kernel: [362899.002653] ata5: softreset failed (1st FIS failed)
Apr 6 12:37:27 ps13 kernel: [362899.002656] ata5: hard resetting link
Apr 6 12:37:27 ps13 kernel: [362899.002706] ata6: softreset failed (1st FIS failed)
Apr 6 12:37:27 ps13 kernel: [362899.002709] ata6: hard resetting link
Apr 6 12:37:37 ps13 kernel: [362909.002577] ata1: softreset failed (1st FIS failed)
Apr 6 12:37:37 ps13 kernel: [362909.002583] ata1: hard resetting link
Apr 6 12:37:37 ps13 kernel: [362909.002596] ata6: softreset failed (1st FIS failed)
Apr 6 12:37:37 ps13 kernel: [362909.002599] ata6: hard resetting link
Apr 6 12:37:37 ps13 kernel: [362909.002637] ata5: softreset failed (1st FIS failed)
Apr 6 12:37:37 ps13 kernel: [362909.002640] ata5: hard resetting link
Apr 6 12:37:42 ps13 kernel: [362914.050962] ata2: softreset failed (1st FIS failed)
Apr 6 12:37:42 ps13 kernel: [362914.050969] ata2: limiting SATA link speed to 3.0 Gbps
Apr 6 12:37:42 ps13 kernel: [362914.050970] ata2: hard resetting link
Apr 6 12:37:47 ps13 kernel: [362919.050857] ata2: softreset failed (1st FIS failed)
Apr 6 12:37:47 ps13 kernel: [362919.050864] ata2: reset failed, giving up
Apr 6 12:37:47 ps13 kernel: [362919.050866] ata2.00: disabled
Apr 6 12:37:47 ps13 kernel: [362919.050886] ata2: EH complete
Apr 6 12:37:47 ps13 kernel: [362919.050911] sd 1:0:0:0: [sdb] tag#15 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr 6 12:37:47 ps13 kernel: [362919.050912] sd 1:0:0:0: [sdb] tag#15 CDB: Read(10) 28 00 02 a1 f6 30 00 00 08 00
Apr 6 12:37:47 ps13 kernel: [362919.050914] blk_update_request: I/O error, dev sdb, sector 44168752 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 6 12:37:47 ps13 kernel: [362919.050930] sd 1:0:0:0: [sdb] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr 6 12:37:47 ps13 kernel: [362919.050931] sd 1:0:0:0: [sdb] tag#18 CDB: Read(10) 28 00 01 de 47 98 00 00 08 00
Apr 6 12:37:47 ps13 kernel: [362919.050931] blk_update_request: I/O error, dev sdb, sector 31344536 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 6 12:37:47 ps13 kernel: [362919.050940] sd 1:0:0:0: [sdb] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr 6 12:37:47 ps13 kernel: [362919.050941] sd 1:0:0:0: [sdb] tag#19 CDB: Write(10) 2a 00 08 50 08 00 00 00 10 00
Apr 6 12:37:47 ps13 kernel: [362919.050942] blk_update_request: I/O error, dev sdb, sector 139462656 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
Apr 6 12:37:47 ps13 kernel: [362919.050949] Buffer I/O error on dev dm-24, logical block 358784, lost async page write
Apr 6 12:37:47 ps13 kernel: [362919.050957] Buffer I/O error on dev dm-24, logical block 358785, lost async page write
Apr 6 12:37:47 ps13 kernel: [362919.050961] sd 1:0:0:0: [sdb] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr 6 12:37:47 ps13 kernel: [362919.050962] sd 1:0:0:0: [sdb] tag#20 CDB: Write(10) 2a 00 08 4f 8e c0 00 00 08 00
Apr 6 12:37:47 ps13 kernel: [362919.050963] blk_update_request: I/O error, dev sdb, sector 139431616 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
And then with all othed SSDs.
But NVMe (on which proxmox is installed works fine).
After host restart all works fine.

Same problem i have on few nodes (on other proxmox installed on SSD too). But lost only disks with lmv-thin storages.


pveversion --verbose
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-6 (running version: 6.3-6/2184247e)
pve-kernel-5.4: 6.3-7
pve-kernel-helper: 6.3-7
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.10-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-6
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-3
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-8
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve2
 

ph0x

Active Member
Jul 5, 2020
585
84
28
Do you have some sort of HBA or RAID controller? If all disks are affected, maybe the controller starts to die.
 

Aleksej

Member
Feb 25, 2018
25
0
6
35
Do you have some sort of HBA or RAID controller?
No. there is very simple host config. NVMe is on M.2, on rest 4 SATA is 1 HDD and 3 SSD.
All SSD is new. (on one host Goodram CL100, Crucial MX500; on other 2xAMD Ryzen5 and Sams QVO 870)

I found interesting post https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1610622 here, but disablin IOMMU doens't have effect...

Also. "lost" only SSD. HDD and NVMe is still online...

Apr 6 12:38:07 ps13 kernel: [362939.886004] blk_update_request: I/O error, dev sdb, sector 44168752 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 6 12:38:08 ps13 kernel: [362939.932921] sd 1:0:0:0: [sdb] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr 6 12:38:08 ps13 kernel: [362939.932922] sd 1:0:0:0: [sdb] tag#20 CDB: Read(10) 28 00 02 a1 f6 30 00 00 08 00
Apr 6 12:38:08 ps13 kernel: [362939.932923] blk_update_request: I/O error, dev sdb, sector 44168752 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 6 12:38:08 ps13 kernel: [362940.229776] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
Apr 6 12:38:08 ps13 kernel: [362940.229781] Buffer I/O error on dev dm-20, logical block 2188930, async page read
Apr 6 12:38:08 ps13 kernel: [362940.230756] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
Apr 6 12:38:08 ps13 kernel: [362940.230762] Buffer I/O error on dev dm-20, logical block 2188937, async page read
Apr 6 12:38:09 ps13 kernel: [362941.245369] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
Apr 6 12:38:09 ps13 kernel: [362941.245373] Buffer I/O error on dev dm-20, logical block 2188937, async page read
Apr 6 12:38:10 ps13 kernel: [362942.261020] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
Apr 6 12:38:10 ps13 kernel: [362942.261024] Buffer I/O error on dev dm-20, logical block 2188937, async page read
Apr 6 12:38:11 ps13 kernel: [362943.276621] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
Apr 6 12:38:12 ps13 kernel: [362944.002508] ata1: softreset failed (1st FIS failed)
Apr 6 12:38:12 ps13 kernel: [362944.002516] ata1: limiting SATA link speed to 3.0 Gbps
Apr 6 12:38:12 ps13 kernel: [362944.002517] ata1: hard resetting link
Apr 6 12:38:12 ps13 kernel: [362944.002783] ata6: softreset failed (1st FIS failed)
Apr 6 12:38:12 ps13 kernel: [362944.002787] ata6: limiting SATA link speed to 3.0 Gbps
Apr 6 12:38:12 ps13 kernel: [362944.002788] ata6: hard resetting link
Apr 6 12:38:12 ps13 kernel: [362944.002960] ata5: softreset failed (1st FIS failed)
Apr 6 12:38:12 ps13 kernel: [362944.002962] ata5: limiting SATA link speed to 3.0 Gbps
Apr 6 12:38:12 ps13 kernel: [362944.002963] ata5: hard resetting link
Apr 6 12:38:12 ps13 kernel: [362944.276624] scsi_io_completion_action: 17 callbacks suppressed
Apr 6 12:38:12 ps13 kernel: [362944.276626] sd 1:0:0:0: [sdb] tag#2 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr 6 12:38:12 ps13 kernel: [362944.276627] sd 1:0:0:0: [sdb] tag#2 CDB: Read(10) 28 00 73 42 b5 28 00 00 08 00
Apr 6 12:38:12 ps13 kernel: [362944.276628] print_req_error: 17 callbacks suppressed
Apr 6 12:38:12 ps13 kernel: [362944.276629] blk_update_request: I/O error, dev sdb, sector 1933751592 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 6 12:38:12 ps13 kernel: [362944.276637] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
Apr 6 12:38:12 ps13 kernel: [362944.276641] buffer_io_error: 1 callbacks suppressed
Apr 6 12:38:12 ps13 kernel: [362944.276642] Buffer I/O error on dev dm-20, logical block 2188937, async page read
Apr 6 12:38:12 ps13 kernel: [362944.276884] sd 1:0:0:0: [sdb] tag#3 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr 6 12:38:12 ps13 kernel: [362944.276885] sd 1:0:0:0: [sdb] tag#3 CDB: Read(10) 28 00 73 42 b5 28 00 00 08 00

....


Apr 6 12:38:14 ps13 kernel: [362946.292254] blk_update_request: I/O error, dev sdb, sector 1933751592 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 6 12:38:14 ps13 kernel: [362946.292262] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
Apr 6 12:38:14 ps13 kernel: [362946.292267] Buffer I/O error on dev dm-20, logical block 2188930, async page read
Apr 6 12:38:15 ps13 kernel: [362947.292263] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
Apr 6 12:38:15 ps13 kernel: [362947.292268] Buffer I/O error on dev dm-20, logical block 2188930, async page read
Apr 6 12:38:16 ps13 kernel: [362948.307882] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
Apr 6 12:38:16 ps13 kernel: [362948.307886] Buffer I/O error on dev dm-20, logical block 2188930, async page read
Apr 6 12:38:16 ps13 kernel: [362948.308048] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
Apr 6 12:38:16 ps13 kernel: [362948.308051] Buffer I/O error on dev dm-20, logical block 2188930, async page read
Apr 6 12:38:17 ps13 kernel: [362949.002751] ata1: softreset failed (1st FIS failed)
Apr 6 12:38:17 ps13 kernel: [362949.002758] ata1: reset failed, giving up
Apr 6 12:38:17 ps13 kernel: [362949.002760] ata1.00: disabled
Apr 6 12:38:17 ps13 kernel: [362949.002772] ata1: EH complete
Apr 6 12:38:17 ps13 kernel: [362949.002792] ata6: softreset failed (1st FIS failed)
Apr 6 12:38:17 ps13 kernel: [362949.002796] ata6: reset failed, giving up
Apr 6 12:38:17 ps13 kernel: [362949.002798] ata6.00: disabled
Apr 6 12:38:17 ps13 kernel: [362949.002809] ata6: EH complete
Apr 6 12:38:17 ps13 kernel: [362949.002811] ata5: softreset failed (1st FIS failed)
Apr 6 12:38:17 ps13 kernel: [362949.002815] ata5: reset failed, giving up
Apr 6 12:38:17 ps13 kernel: [362949.002817] ata5.00: disabled
Apr 6 12:38:17 ps13 kernel: [362949.002824] ata5: EH complete
Apr 6 12:38:17 ps13 kernel: [362949.323550] scsi_io_completion_action: 18 callbacks suppressed
Apr 6 12:38:17 ps13 kernel: [362949.323554] sd 1:0:0:0: [sdb] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr 6 12:38:17 ps13 kernel: [362949.323556] sd 1:0:0:0: [sdb] tag#17 CDB: Read(10) 28 00 73 42 b5 28 00 00 08 00
Apr 6 12:38:17 ps13 kernel: [362949.323557] print_req_error: 18 callbacks suppressed
Apr 6 12:38:17 ps13 kernel: [362949.323558] blk_update_request: I/O error, dev sdb, sector 1933751592 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 6 12:38:17 ps13 kernel: [362949.323570] device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
Apr 6 12:38:17 ps13 kernel: [362949.323577] Buffer I/O error on dev dm-20, logical block 2188930, async page read
Apr 6 12:38:17 ps13 pvestatd[1656]: status update time (90.781 seconds)
Apr 6 12:38:17 ps13 kernel: [362949.482978] sd 0:0:0:0: [sda] tag#26 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr 6 12:38:17 ps13 kernel: [362949.482981] sd 0:0:0:0: [sda] tag#26 CDB: Read(10) 28 00 00 00 00 00 00 01 00 00
Apr 6 12:38:17 ps13 kernel: [362949.482982] blk_update_request: I/O error, dev sda, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
Apr 6 12:38:17 ps13 kernel: [362949.483019] sd 0:0:0:0: [sda] tag#27 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr 6 12:38:17 ps13 kernel: [362949.483020] sd 0:0:0:0: [sda] tag#27 CDB: Read(10) 28 00 00 00 08 00 00 01 00 00
Apr 6 12:38:17 ps13 kernel: [362949.483020] blk_update_request: I/O error, dev sda, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
Apr 6 12:38:17 ps13 kernel: [362949.483115] sd 1:0:0:0: [sdb] tag#26 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr 6 12:38:17 ps13 kernel: [362949.483116] sd 1:0:0:0: [sdb] tag#26 CDB: Read(10) 28 00 00 00 00 00 00 01 00 00

as you can see a bit later the same with other disk (/dev/sda)
 
Last edited:

avw

Active Member
May 31, 2020
403
63
28
the Netherlands
Looks like device 02:00.1 (probably a SATA controller) does something that the AMD IOMMU does not like and disables it? Can you show us your IOMMU groups using for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done? Are you using PCI passthrough with some of your VMs? Maybe adding iommu=pt (using identity mapping for devices that belong to the host) to the kernel command line can help?
 

Aleksej

Member
Feb 25, 2018
25
0
6
35
IOMMU group 0 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU group 10 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU group 11 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU group 12 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 61)
IOMMU group 12 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
IOMMU group 13 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 0 [1022:1440]
IOMMU group 13 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 1 [1022:1441]
IOMMU group 13 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 2 [1022:1442]
IOMMU group 13 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 3 [1022:1443]
IOMMU group 13 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 4 [1022:1444]
IOMMU group 13 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 5 [1022:1445]
IOMMU group 13 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 6 [1022:1446]
IOMMU group 13 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 7 [1022:1447]
IOMMU group 14 01:00.0 Non-Volatile memory controller [0108]: Silicon Motion, Inc. Device [126f:2263] (rev 03)
IOMMU group 15 02:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 XHCI Controller [1022:43d5] (rev 01)
IOMMU group 15 02:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller [1022:43c8] (rev 01)
IOMMU group 15 02:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Bridge [1022:43c6] (rev 01)
IOMMU group 15 03:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
IOMMU group 15 03:01.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
IOMMU group 15 03:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
IOMMU group 15 05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)
IOMMU group 16 07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK208 [GeForce GT 710B] [10de:128b] (rev a1)
IOMMU group 16 07:00.1 Audio device [0403]: NVIDIA Corporation GK208 HDMI/DP Audio Controller [10de:0e0f] (rev a1)
IOMMU group 17 08:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function [1022:148a]
IOMMU group 18 09:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
IOMMU group 19 09:00.1 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP [1022:1486]
IOMMU group 1 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU group 20 09:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]
IOMMU group 21 09:00.4 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller [1022:1487]
IOMMU group 2 00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU group 3 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU group 4 00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU group 5 00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU group 6 00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU group 7 00:05.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU group 8 00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU group 9 00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]

Actually, i do not need IOMMU at all. I'm not using devices passthrough to VMs. i tried to passthough GPU but there is nothing good with this motherboard. so i removed "amd_iommu=on" from grub.
also there is nothing connected to usb ports.
 
Last edited:

avw

Active Member
May 31, 2020
403
63
28
the Netherlands
Then I have no idea. Maybe try not to use the USB Controller 02:00.0; try to move USB-devices to the other USB controller ports. Maybe adding iommu=pt will help (the IOMMU exists, even when you don't do passthrough). Maybe another (older) BIOS version will help? What CPU are you using? Sometimes older BIOS versions are more stable with older CPUs.
 

Aleksej

Member
Feb 25, 2018
25
0
6
35
Trese is no any usb devices.
That is a host with only power and network cables connected))))

I tried to disable iommu in bios. Still no effect.
The worst thing that i cannot see what causes this.... It works fine, but in some moment it happens. And nothing in logs
 

Aleksej

Member
Feb 25, 2018
25
0
6
35
Replace its a bit harder, but i have 3 hosts.
There is different power supply units and CPUs (Ryz 7 3900X,Ryz 9 3950X, Ryz 9 5900) and on all of them the same problem.
Hosts are not loaded high (avg cpu usage 10%, memory not higher than 60%).
 

avw

Active Member
May 31, 2020
403
63
28
the Netherlands
Three similar but not identical setups having the exact same problem is unlikely. Maybe there is a common factor causing trouble: power quality from the electric grid or network switch. Maybe you can try replacing the switch and/or use other outlets/power groups, maybe a UPS for one of the machines? Interesting puzzle but I have no real clues.
 

Aleksej

Member
Feb 25, 2018
25
0
6
35
maybe a UPS for one of the machines?
Seems you are right. Bu very interesting. I have 2 hosts connected to one UPS. I reviewed situation with power, yes, there was switching to ups for a half a second. the whole machine is working, but only SSDs are felt it...
Now i permanently disabled IOMMU in BIOS. will see more time. for few days all is ok with iommu disabled.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!