storage keeps dying? please help.

Karl0ss · Jan 13, 2023

Running proxmox 7.3-4 for the last month, at the moment its a single node, running on a brand new zimaboard (832) and a (1 month old) 256gb SSD, all has been running fine, but this morning I woke up to everything (Ui and all machines running on the node) being unreachable, and after rebooting the node and machines came back up, but then died not long after, while the UI was open, I noticed that local-lvm and local have vanished, surely this HDD can't be dying after 1 months usage? I have attached the log, hopefully something makes sense to someone and I can get some help.

Thanks

Karl0ss · Jan 13, 2023

Just to update this, having rebooted a few times I am able to turn off all my autoboots, so there are no VM or lxc's running on the node, but still, after 2-5 mins of running the system dies, here is an updated log, like I said this is a new (xmas day) SSD, and having run a SMART check on it when I did reboot, there were no errors...

shanreich · Jan 13, 2023

Have you checked the cabling of the disk? Check if everything is plugged in properly and, if the problem persists, try a different cable.

Karl0ss · Jan 13, 2023

Thank you for your respone shanreich

Yeah, I have done this in the last hour, new data and power cable, but still the same issue after a few mins, all had been running fine for a few weeks, so I really don't understand what is going on, like I say it will also do it after a few mins with no machines running, so it doesn't appear to be under any load either.

Karl0ss · Jan 13, 2023

Looks like a cron job runs, and then it dies, I have just switched the HDD onto a different sata port, and this was the same issue...

I'm starting to think the drive must have messed up? I assume it's not the zimaboard, as that is also brand new..

Code:

Jan 13 16:17:01 zimavm1 CRON[3492]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 13 16:17:01 zimavm1 CRON[3493]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jan 13 16:17:01 zimavm1 CRON[3492]: pam_unix(cron:session): session closed for user root
Jan 13 16:18:45 zimavm1 kernel: ata1.00: exception Emask 0x0 SAct 0x4000e0 SErr 0x50000 action 0x6 frozen
Jan 13 16:18:45 zimavm1 kernel: ata1: SError: { PHYRdyChg CommWake }
Jan 13 16:18:45 zimavm1 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jan 13 16:18:45 zimavm1 kernel: ata1.00: cmd 61/10:28:20:90:08/00:00:01:00:00/40 tag 5 ncq dma 8192 out
         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 13 16:18:45 zimavm1 kernel: ata1.00: status: { DRDY }
Jan 13 16:18:45 zimavm1 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jan 13 16:18:45 zimavm1 kernel: ata1.00: cmd 61/98:30:b8:bf:f3/01:00:01:00:00/40 tag 6 ncq dma 208896 out
         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 13 16:18:45 zimavm1 kernel: ata1.00: status: { DRDY }
Jan 13 16:18:45 zimavm1 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jan 13 16:18:45 zimavm1 kernel: ata1.00: cmd 61/08:38:10:c0:a0/00:00:02:00:00/40 tag 7 ncq dma 4096 out
         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 13 16:18:45 zimavm1 kernel: ata1.00: status: { DRDY }
Jan 13 16:18:45 zimavm1 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jan 13 16:18:45 zimavm1 kernel: ata1.00: cmd 60/00:b0:00:08:10/01:00:00:00:00/40 tag 22 ncq dma 131072 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 13 16:18:45 zimavm1 kernel: ata1.00: status: { DRDY }
Jan 13 16:18:45 zimavm1 kernel: ata1: hard resetting link
Jan 13 16:19:45 zimavm1 kernel: ata1: softreset failed (1st FIS failed)
Jan 13 16:19:45 zimavm1 kernel: ata1: hard resetting link
Jan 13 16:19:45 zimavm1 kernel: ata1: softreset failed (1st FIS failed)
Jan 13 16:19:45 zimavm1 kernel: ata1: hard resetting link
Jan 13 16:19:45 zimavm1 kernel: ata1: softreset failed (1st FIS failed)
Jan 13 16:19:45 zimavm1 kernel: ata1: limiting SATA link speed to 3.0 Gbps
Jan 13 16:19:45 zimavm1 kernel: ata1: hard resetting link
Jan 13 16:19:45 zimavm1 kernel: ata1: softreset failed (device not ready)
Jan 13 16:19:45 zimavm1 kernel: ata1: reset failed, giving up
Jan 13 16:19:45 zimavm1 kernel: ata1.00: disabled
Jan 13 16:19:45 zimavm1 kernel: ata1: EH complete
Jan 13 16:19:45 zimavm1 kernel: blk_update_request: I/O error, dev sda, sector 44089360 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jan 13 16:19:45 zimavm1 kernel: sd 0:0:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=90s
Jan 13 16:19:45 zimavm1 kernel: sd 0:0:0:0: [sda] tag#2 CDB: Read(10) 28 00 00 10 08 00 00 01 00 00

shanreich · Jan 16, 2023

Definitely screams hardware issue of some kind to me (either drive or mobo). Hard to tell from afar whether it is the drive or the mother board.
You could try using the drive on another machine with a different mobo, see if it works there. Or try using another drive with the same mobo, see if that works.

Upgrading the firmware of the disk might work, albeit unlikely since it has been working for a few weeks before problems started appearing.

What's the exact output of smartctl?

Code:

smartctl -a /dev/sdX

storage keeps dying? please help.

Karl0ss

New Member

Attachments

Karl0ss

New Member

Attachments

shanreich

Proxmox Staff Member

Karl0ss

New Member

Karl0ss

New Member

shanreich

Proxmox Staff Member

We value your privacy