Problem with ssds

z3r0bug

Member
Apr 12, 2021
5
0
6
37
Denmark/Germany
Hello Proxmox Forum,

I have since I installed proxmox permanently problems that after x amount of days apparently my ssds are offline or abort the connection and thus my complete proxmox setup crashes. I have installed the ssds normally and direct on the sata ports in the server where proxmox is running.

please help with some ideas.

this are the following error:

Jun 01 12:16:20 orilla kernel: ata2.00: exception Emask 0x10 SAct 0x7f827ff8 SErr 0x0 action 0x6 frozen
Jun 01 12:16:20 orilla kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:18:00:9a:fc/00:00:0a:00:00/40 tag 3 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:20:18:9a:fc/00:00:0a:00:00/40 tag 4 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/10:28:20:9b:fc/00:00:0a:00:00/40 tag 5 ncq dma 8192 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:30:f0:9b:fc/00:00:0a:00:00/40 tag 6 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:38:78:f3:fc/00:00:0a:00:00/40 tag 7 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:40:08:26:fd/00:00:0a:00:00/40 tag 8 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/18:48:70:9c:fc/00:00:0a:00:00/40 tag 9 ncq dma 12288 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/10:50:48:9d:fc/00:00:0a:00:00/40 tag 10 ncq dma 8192 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:58:f8:9d:fc/00:00:0a:00:00/40 tag 11 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:60:58:9e:fc/00:00:0a:00:00/40 tag 12 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:68:c8:9e:fc/00:00:0a:00:00/40 tag 13 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/10:70:b8:9f:fc/00:00:0a:00:00/40 tag 14 ncq dma 8192 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/10:88:90:dc:2b/00:00:01:00:00/40 tag 17 ncq dma 8192 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:b8:00:73:fc/00:00:0a:00:00/40 tag 23 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:c0:68:73:fc/00:00:0a:00:00/40 tag 24 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:c8:68:98:fc/00:00:0a:00:00/40 tag 25 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:d0:a8:98:fc/00:00:0a:00:00/40 tag 26 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:d8:e0:98:fc/00:00:0a:00:00/40 tag 27 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:e0:70:99:fc/00:00:0a:00:00/40 tag 28 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:e8:c0:99:fc/00:00:0a:00:00/40 tag 29 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jun 01 12:16:20 orilla kernel: ata2.00: cmd 61/08:f0:e8:99:fc/00:00:0a:00:00/40 tag 30 ncq dma 4096 out
res 40/00:70:b8:9f:fc/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 01 12:16:20 orilla kernel: ata2.00: status: { DRDY }
Jun 01 12:16:20 orilla kernel: ata2: hard resetting link
Jun 01 12:16:21 orilla kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jun 01 12:16:21 orilla kernel: ata2.00: supports DRM functions and may not be fully accessible
Jun 01 12:16:21 orilla kernel: ata2.00: supports DRM functions and may not be fully accessible
Jun 01 12:16:21 orilla kernel: ata2.00: configured for UDMA/133
Jun 01 12:16:21 orilla kernel: ata2: EH complete
Jun 01 12:16:21 orilla kernel: ata2.00: Enabling discard_zeroes_data
 
What SSDs are those? Are they healthy, i.e. what are the s.m.a.r.t. values?
Is it a system disks or VM image disk or both?
 
Both of them are Samsung 870 QVO, and they should be healthy just run for few days. They are just used for VM Image disk. Proxmox OS is running on another ssd.
some screens with smart values.
Screenshot_1.pngScreenshot_2.png
 
How many images are actively running on the disks?
You should either get appropriate SSDs (enterprise grade MLC) or keep an eye on the wearout since these are QLC disks which will be torn apart in quite short time.
 
3-4 vms and 9 cts. so not alot. i know that qlc are cheap ssds but they only run for about 2 month. that is a short period, even for qlc, i think.
 
Well, this are 13 concurrent disk operations. The problem right now might be the iops rather than the wearout but that will come pretty quickly.
I'd start with fewer VMs/CTs and try to monitor if the error occurs with a specific number of machines.
And for the SSDs, Samsung SM/PM series are a good start.
 
i know that qlc are cheap ssds but they only run for about 2 month. that is a short period, even for qlc, i think.
Dont underestimate the write amplification caused by sync writes, virtualization and zfs. Depending on your workload you can kill them within weeks if you really want. But SMART says only 1% wearout so that shouldn't be the problem right now.

Did you flashed the newest firmware on there drives?

Are you sure they are not connected to SATA ports that offer some kind of hardware raid if you are using ZFS as software raid?

As they are complaining about DMA errors, did you run memtest86 to verify that you RAM is fine?
 
Dont underestimate the write amplification caused by sync writes, virtualization and zfs. Depending on your workload you can kill them within weeks if you really want. But SMART says only 1% wearout so that shouldn't be the problem right now.

Did you flashed the newest firmware on there drives?

Are you sure they are not connected to SATA ports that offer some kind of hardware raid if you are using ZFS as software raid?

As they are complaining about DMA errors, did you run memtest86 to verify that you RAM is fine?
yeah, i am sure, the sata ports does not provide any kind of raid at all. And the RAM should not be the problem, i ran 48hr test and no errors. i might change the disk with some new.. though iam not sure which one yet.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!