[SOLVED] Again, Proxmox not loading WebUI, VMs dead

LooneyTunes · Aug 11, 2021

Hi,

I need help. This is the second time my Proxmox has died spontaneous for no apparent reason. Last time, the "solution" was to upgrade to 7.0, Last issue

I still have SSH access, so some means to debug or find a log etc is still there. Please help, I don't know where to start look. It has only been running two VMs, no particular stress on any of them. This time filesystem is mounted read-only.

As I cant log into it, all I can say for now is latest 7.0, patched yesterday after it seemed to be ok. This morning however, it is not.

Code:

Aug 11 01:17:01 pve CRON[1599807]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 11 02:17:01 pve CRON[1611263]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 11 02:50:19 pve kernel: [493854.313196] ata1.00: exception Emask 0x10 SAct 0x8000000 SErr 0x4040000 action 0xe frozen
Aug 11 02:50:19 pve kernel: [493854.313221] ata1.00: irq_stat 0x00000040, connection status changed
Aug 11 02:50:19 pve kernel: [493854.313223] ata1: SError: { CommWake DevExch }
Aug 11 02:50:19 pve kernel: [493854.313227] ata1.00: failed command: WRITE FPDMA QUEUED
Aug 11 02:50:19 pve kernel: [493854.313229] ata1.00: cmd 61/10:d8:f0:88:9b/00:00:04:00:00/40 tag 27 ncq dma 8192 out
Aug 11 02:50:19 pve kernel: [493854.313229]          res 40/00:d4:00:08:10/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Aug 11 02:50:19 pve kernel: [493854.313236] ata1.00: status: { DRDY }
Aug 11 02:50:19 pve kernel: [493854.313240] ata1: hard resetting link
Aug 11 02:50:19 pve kernel: [493855.025137] ata1: SATA link down (SStatus 0 SControl 300)
Aug 11 02:50:19 pve kernel: [493855.052681] ata1: hard resetting link
Aug 11 02:50:20 pve kernel: [493855.425892] ata1: SATA link down (SStatus 0 SControl 300)
Aug 11 02:50:20 pve kernel: [493855.564677] ata1: hard resetting link

Findings:
- Are my drive failing?
- Found this thread where it seems same issue were due to a bad cable...
- HDD temp has been ~ 70-80 C for a long time... Why so hot have no idea, just thought it should be - now discovered that > 50 C is concidered too hot... I probably cooked the drive

Finally opened it up and blew out some dust from the fan. Also removed the probably at least somewhat defect HDD, removed the metal brackets and cables, and feel pretty good about having taken a backup as late as yesterday. Will do a reinstall from scratch on a new M.2 drive. Hopefully this was a heat issue, and by this now is resolved once and for all...

fabian · Aug 11, 2021

yeah, messages like that indicate a HW problem - cables, connectors, disk..

LooneyTunes · Aug 11, 2021

fabian said:
yeah, messages like that indicate a HW problem - cables, connectors, disk..

Thanks for confirming. I will start over, but can't find any means to verify the download, ie md5-sum or sha-files?

fabian · Aug 11, 2021

which download?

LooneyTunes · Aug 11, 2021

fabian said:
which download?

"proxmox-ve_7.0-1.iso"

fabian · Aug 11, 2021

yeah, the download page contains the checksum:

https://proxmox.com/en/downloads/item/proxmox-ve-7-0-iso-installer

Code:

SHA256SUMS for the ISO:

ae38bcb5ecc9aa97f6b13b89689fc4e876f9535f738bc0be4ffa4924274f25d9

LooneyTunes · Aug 11, 2021

fabian said:
yeah, the download page contains the checksum:

https://proxmox.com/en/downloads/item/proxmox-ve-7-0-iso-installer

Code:

SHA256SUMS for the ISO: ae38bcb5ecc9aa97f6b13b89689fc4e876f9535f738bc0be4ffa4924274f25d9

Oh! Figures, I was so surprised not to find it. Never occurred to me to click on the heading, sorry for that and thanks!

hr556 · Aug 19, 2021

LooneyTunes said:
Oh! Figures, I was so surprised not to find it. Never occurred to me to click on the heading, sorry for that and thanks!

Any updates at your front? I'm encountering similar issues and also thinking about heat problems...

LooneyTunes · Aug 19, 2021

daniel.tremmel said:
Any updates at your front? I'm encountering similar issues and also thinking about heat problems...

Well, my old HDD finally gave up so had to replace that, but after reinstalling on a new disk I haven't had any crashes like these (and hope not to). I also learned I had read the disks SMART report wrong, so thought I still had heat issues, whereas the disk in fact was merely 34C.

hazaki · Aug 19, 2021

I had the same problem today, motherboard is Supermicro H11SSL-I, cpu is AMD 7401P. changed all cables and hard drives, problem still exists. Later I learned from AMD forum that it is a bug and the solution is to avoid using SATA 0-3 interface. After replacing it with SATA #4 interface, the problem was solved. I hope it will help people using the same platform.

hr556 · Aug 20, 2021

So, I think I've found the issue - at least the faulty part. I had one single VM (Windows Server 2019) running on this host (Intel NUC, Proxmox v7.0-11), as soon as I stopped or migrated that VM away from the host, everything went smooth - incl. adding new VMs/Containers and running them for hours. The faulty VM wasn't up and running an hour or so before everything crashed and I needed to restart the host via pulling out the power cable.

So for everyone having the same issue: try stopping VMs running on that host and see if this helps. I think this might be some weird hard disk issue triggered by that single VM, but I'm not quite sure. Nothing found in the logs.

LooneyTunes · Aug 23, 2021

Closing this thread as my issue was due to a faulty HDD which has been replaced.

Search

Search

[SOLVED] Again, Proxmox not loading WebUI, VMs dead

LooneyTunes

Active Member

fabian

Proxmox Staff Member

LooneyTunes

Active Member

fabian

Proxmox Staff Member

LooneyTunes

Active Member

fabian

Proxmox Staff Member

LooneyTunes

Active Member

hr556

Member

LooneyTunes

Active Member

hazaki

Member

hr556

Member

LooneyTunes

Active Member