[SOLVED] Again, Proxmox not loading WebUI, VMs dead

LooneyTunes

Active Member
Jun 1, 2019
203
14
38
Hi,

I need help. This is the second time my Proxmox has died spontaneous for no apparent reason. Last time, the "solution" was to upgrade to 7.0, Last issue

I still have SSH access, so some means to debug or find a log etc is still there. Please help, I don't know where to start look. It has only been running two VMs, no particular stress on any of them. This time filesystem is mounted read-only.

As I cant log into it, all I can say for now is latest 7.0, patched yesterday after it seemed to be ok. This morning however, it is not.

Code:
Aug 11 01:17:01 pve CRON[1599807]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 11 02:17:01 pve CRON[1611263]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 11 02:50:19 pve kernel: [493854.313196] ata1.00: exception Emask 0x10 SAct 0x8000000 SErr 0x4040000 action 0xe frozen
Aug 11 02:50:19 pve kernel: [493854.313221] ata1.00: irq_stat 0x00000040, connection status changed
Aug 11 02:50:19 pve kernel: [493854.313223] ata1: SError: { CommWake DevExch }
Aug 11 02:50:19 pve kernel: [493854.313227] ata1.00: failed command: WRITE FPDMA QUEUED
Aug 11 02:50:19 pve kernel: [493854.313229] ata1.00: cmd 61/10:d8:f0:88:9b/00:00:04:00:00/40 tag 27 ncq dma 8192 out
Aug 11 02:50:19 pve kernel: [493854.313229]          res 40/00:d4:00:08:10/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Aug 11 02:50:19 pve kernel: [493854.313236] ata1.00: status: { DRDY }
Aug 11 02:50:19 pve kernel: [493854.313240] ata1: hard resetting link
Aug 11 02:50:19 pve kernel: [493855.025137] ata1: SATA link down (SStatus 0 SControl 300)
Aug 11 02:50:19 pve kernel: [493855.052681] ata1: hard resetting link
Aug 11 02:50:20 pve kernel: [493855.425892] ata1: SATA link down (SStatus 0 SControl 300)
Aug 11 02:50:20 pve kernel: [493855.564677] ata1: hard resetting link

Findings:
- Are my drive failing?
- Found this thread where it seems same issue were due to a bad cable...
- HDD temp has been ~ 70-80 C for a long time... Why so hot have no idea, just thought it should be - now discovered that > 50 C is concidered too hot... I probably cooked the drive :(

Finally opened it up and blew out some dust from the fan. Also removed the probably at least somewhat defect HDD, removed the metal brackets and cables, and feel pretty good about having taken a backup as late as yesterday. Will do a reinstall from scratch on a new M.2 drive. Hopefully this was a heat issue, and by this now is resolved once and for all...
 
Last edited:
yeah, messages like that indicate a HW problem - cables, connectors, disk..
 
which download?
 
Oh! Figures, I was so surprised not to find it. Never occurred to me to click on the heading, sorry for that and thanks!
Any updates at your front? I'm encountering similar issues and also thinking about heat problems...
 
Any updates at your front? I'm encountering similar issues and also thinking about heat problems...
Well, my old HDD finally gave up so had to replace that, but after reinstalling on a new disk I haven't had any crashes like these (and hope not to). I also learned I had read the disks SMART report wrong, so thought I still had heat issues, whereas the disk in fact was merely 34C.
 
I had the same problem today, motherboard is Supermicro H11SSL-I, cpu is AMD 7401P. changed all cables and hard drives, problem still exists. Later I learned from AMD forum that it is a bug and the solution is to avoid using SATA 0-3 interface. After replacing it with SATA #4 interface, the problem was solved. I hope it will help people using the same platform.
 
So, I think I've found the issue - at least the faulty part. I had one single VM (Windows Server 2019) running on this host (Intel NUC, Proxmox v7.0-11), as soon as I stopped or migrated that VM away from the host, everything went smooth - incl. adding new VMs/Containers and running them for hours. The faulty VM wasn't up and running an hour or so before everything crashed and I needed to restart the host via pulling out the power cable.

So for everyone having the same issue: try stopping VMs running on that host and see if this helps. I think this might be some weird hard disk issue triggered by that single VM, but I'm not quite sure. Nothing found in the logs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!