PVE crashes regularly

Rnr-vienna

New Member
Apr 25, 2023
2
0
1
My PVE (running in a homelab on desktop hardware) keeps crashing. First it was crashing ever 2 month or so, but lastly, it crashed on two of the last three sundays at exactly the same time.
I retrieved some info from the SYSLOG from the webinterface:

Crash 1:
Apr 09 06:47:01 pve CRON[1164866]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)

Apr 09 06:47:01 pve CRON[1164867]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly ))

Apr 09 06:47:01 pve CRON[1164866]: pam_unix(cron:session): session closed for user root

Crash 2:
Apr 23 06:47:01 pve CRON[2051669]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)

Apr 23 06:47:01 pve CRON[2051670]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly ))

Apr 23 06:47:01 pve CRON[2051669]: pam_unix(cron:session): session closed for user root

Apr 23 06:48:57 pve kernel: ata5.00: exception Emask 0x0 SAct 0x1ffc001 SErr 0x40000 action 0x6 frozen

Apr 23 06:48:57 pve kernel: ata5: SError: { CommWake }

Apr 23 06:48:57 pve kernel: ata5.00: failed command: WRITE FPDMA QUEUED

Apr 23 06:48:57 pve kernel: ata5.00: cmd 61/10:00:c0:e2:ed/00:00:0e:00:00/40 tag 0 ncq dma 8192 out

res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Apr 23 06:48:57 pve kernel: ata5.00: status: { DRDY }

Apr 23 06:48:57 pve kernel: ata5.00: failed command: WRITE FPDMA QUEUED

Apr 23 06:48:57 pve kernel: ata5.00: cmd 61/08:70:f8:f0:9c/00:00:0d:00:00/40 tag 14 ncq dma 4096 out

res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Apr 23 06:48:57 pve kernel: ata5.00: status: { DRDY }

Apr 23 06:48:57 pve kernel: ata5.00: failed command: WRITE FPDMA QUEUED

Apr 23 06:48:57 pve kernel: ata5.00: cmd 61/08:78:b8:b3:9d/00:00:0d:00:00/40 tag 15 ncq dma 4096 out

res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Apr 23 06:48:57 pve kernel: ata5.00: status: { DRDY }

Apr 23 06:48:57 pve kernel: ata5.00: failed command: WRITE FPDMA QUEUED

Apr 23 06:48:57 pve kernel: ata5.00: cmd 61/08:80:c0:87:c3/00:00:0d:00:00/40 tag 16 ncq dma 4096 out

res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Apr 23 06:48:57 pve kernel: ata5.00: status: { DRDY }

Apr 23 06:48:57 pve kernel: ata5.00: failed command: WRITE FPDMA QUEUED

Apr 23 06:48:57 pve kernel: ata5.00: cmd 61/08:88:e8:b7:c3/00:00:0d:00:00/40 tag 17 ncq dma 4096 out

res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Apr 23 06:48:57 pve kernel: ata5.00: status: { DRDY }

Apr 23 06:48:57 pve kernel: ata5.00: failed command: WRITE FPDMA QUEUED

Apr 23 06:48:57 pve kernel: ata5.00: cmd 61/08:90:68:2d:ec/00:00:0e:00:00/40 tag 18 ncq dma 4096 out

res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Apr 23 06:48:57 pve kernel: ata5.00: status: { DRDY }

Apr 23 06:48:57 pve kernel: ata5.00: failed command: WRITE FPDMA QUEUED

Apr 23 06:48:57 pve kernel: ata5.00: cmd 61/08:98:18:3a:ec/00:00:0e:00:00/40 tag 19 ncq dma 4096 out

res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Apr 23 06:48:57 pve kernel: ata5.00: status: { DRDY }

Apr 23 06:48:57 pve kernel: ata5.00: failed command: WRITE FPDMA QUEUED

Apr 23 06:48:57 pve kernel: ata5.00: cmd 61/08:a0:70:e2:ee/00:00:0e:00:00/40 tag 20 ncq dma 4096 out

res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Apr 23 06:48:57 pve kernel: ata5.00: status: { DRDY }

Apr 23 06:48:57 pve kernel: ata5.00: failed command: WRITE FPDMA QUEUED

Apr 23 06:48:57 pve kernel: ata5.00: cmd 61/08:a8:70:2f:f2/00:00:0e:00:00/40 tag 21 ncq dma 4096 out

res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Apr 23 06:48:57 pve kernel: ata5.00: status: { DRDY }

Apr 23 06:48:57 pve kernel: ata5.00: failed command: WRITE FPDMA QUEUED

Apr 23 06:48:57 pve kernel: ata5.00: cmd 61/08:b0:78:da:f0/00:00:0e:00:00/40 tag 22 ncq dma 4096 out

res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Apr 23 06:48:57 pve kernel: ata5.00: status: { DRDY }

Apr 23 06:48:57 pve kernel: ata5.00: failed command: WRITE FPDMA QUEUED

Apr 23 06:48:57 pve kernel: ata5.00: cmd 61/10:b8:b0:08:f2/00:00:0e:00:00/40 tag 23 ncq dma 8192 out

res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Apr 23 06:48:57 pve kernel: ata5.00: status: { DRDY }

Apr 23 06:48:57 pve kernel: ata5.00: failed command: WRITE FPDMA QUEUED

Apr 23 06:48:57 pve kernel: ata5.00: cmd 61/08:c0:30:84:f2/00:00:0e:00:00/40 tag 24 ncq dma 4096 out

res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Apr 23 06:48:57 pve kernel: ata5.00: status: { DRDY }

Apr 23 06:48:57 pve kernel: ata5: hard resetting link

Apr 23 06:48:57 pve kernel: ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Apr 23 06:48:57 pve kernel: ata5.00: configured for UDMA/133

Apr 23 06:48:57 pve kernel: ahci 0000:00:17.0: port does not support device sleep

Apr 23 06:48:58 pve kernel: ata5: EH complete

Apr 23 06:48:58 pve postfix/qmgr[1200]: 2039E34146B: from=<root@pve.local>, size=10154, nrcpt=1 (queue active)

Apr 23 06:48:58 pve pve-firewall[1253]: firewall update time (12.299 seconds)

Apr 23 06:48:58 pve pvestatd[1255]: status update time (31.185 seconds)


I am looking for some help identifying the problem.
I could not find out which job is running every Sunday at 6:47am. An other PVE instance from a friend of mine has the same entry at the same time.

PVE is installed on an M.2 SSD which is most likely the one with all the errors.


Thanks for any help!
 
Is your M.2 SSD using the SATA interface (and not NVMe)? Looks like it is failing and therefore anything on it could be damaged, which could explain crashes because the software could be executing weird instructions. It could also be a memory chip failure that causes all kinds of corruption (which also tends to worsen over time) and may already have affected recent backups. It could even be power supply but I suggest testing the memory and reinstalling Proxmox on another drive.
 
Is your M.2 SSD using the SATA interface (and not NVMe)? Looks like it is failing and therefore anything on it could be damaged, which could explain crashes because the software could be executing weird instructions. It could also be a memory chip failure that causes all kinds of corruption (which also tends to worsen over time) and may already have affected recent backups. It could even be power supply but I suggest testing the memory and reinstalling Proxmox on another drive.
It is an SATA M.2 SSD. A second SSD (2.5“ SATA) is also installed in the system but only used as the storage for my PBS running as a VM on PVE.

I already did a MemTest86+ -> 4 passes and no errors.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!