Hi,
I need help. This is the second time my Proxmox has died spontaneous for no apparent reason. Last time, the "solution" was to upgrade to 7.0, Last issue
I still have SSH access, so some means to debug or find a log etc is still there. Please help, I don't know where to start look. It has only been running two VMs, no particular stress on any of them. This time filesystem is mounted read-only.
As I cant log into it, all I can say for now is latest 7.0, patched yesterday after it seemed to be ok. This morning however, it is not.
Findings:
- Are my drive failing?
- Found this thread where it seems same issue were due to a bad cable...
- HDD temp has been ~ 70-80 C for a long time... Why so hot have no idea, just thought it should be - now discovered that > 50 C is concidered too hot... I probably cooked the drive
Finally opened it up and blew out some dust from the fan. Also removed the probably at least somewhat defect HDD, removed the metal brackets and cables, and feel pretty good about having taken a backup as late as yesterday. Will do a reinstall from scratch on a new M.2 drive. Hopefully this was a heat issue, and by this now is resolved once and for all...
I need help. This is the second time my Proxmox has died spontaneous for no apparent reason. Last time, the "solution" was to upgrade to 7.0, Last issue
I still have SSH access, so some means to debug or find a log etc is still there. Please help, I don't know where to start look. It has only been running two VMs, no particular stress on any of them. This time filesystem is mounted read-only.
As I cant log into it, all I can say for now is latest 7.0, patched yesterday after it seemed to be ok. This morning however, it is not.
Code:
Aug 11 01:17:01 pve CRON[1599807]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 11 02:17:01 pve CRON[1611263]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 11 02:50:19 pve kernel: [493854.313196] ata1.00: exception Emask 0x10 SAct 0x8000000 SErr 0x4040000 action 0xe frozen
Aug 11 02:50:19 pve kernel: [493854.313221] ata1.00: irq_stat 0x00000040, connection status changed
Aug 11 02:50:19 pve kernel: [493854.313223] ata1: SError: { CommWake DevExch }
Aug 11 02:50:19 pve kernel: [493854.313227] ata1.00: failed command: WRITE FPDMA QUEUED
Aug 11 02:50:19 pve kernel: [493854.313229] ata1.00: cmd 61/10:d8:f0:88:9b/00:00:04:00:00/40 tag 27 ncq dma 8192 out
Aug 11 02:50:19 pve kernel: [493854.313229] res 40/00:d4:00:08:10/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Aug 11 02:50:19 pve kernel: [493854.313236] ata1.00: status: { DRDY }
Aug 11 02:50:19 pve kernel: [493854.313240] ata1: hard resetting link
Aug 11 02:50:19 pve kernel: [493855.025137] ata1: SATA link down (SStatus 0 SControl 300)
Aug 11 02:50:19 pve kernel: [493855.052681] ata1: hard resetting link
Aug 11 02:50:20 pve kernel: [493855.425892] ata1: SATA link down (SStatus 0 SControl 300)
Aug 11 02:50:20 pve kernel: [493855.564677] ata1: hard resetting link
Findings:
- Are my drive failing?
- Found this thread where it seems same issue were due to a bad cable...
- HDD temp has been ~ 70-80 C for a long time... Why so hot have no idea, just thought it should be - now discovered that > 50 C is concidered too hot... I probably cooked the drive
Finally opened it up and blew out some dust from the fan. Also removed the probably at least somewhat defect HDD, removed the metal brackets and cables, and feel pretty good about having taken a backup as late as yesterday. Will do a reinstall from scratch on a new M.2 drive. Hopefully this was a heat issue, and by this now is resolved once and for all...
Last edited: