Server fails to boot

gdi2k

Renowned Member
Aug 13, 2016
83
1
73
We had one of our Proxmox VE servers crash today, and it wouldn't boot back up.

I've attached screenshots. Hard disk related? We have 2 hard disks in RAID1, so I feel it's unlikely that both would have failed.

The cluster did its job and the VMs were moved to remaining servers automatically, so no major downtime. :)
 

Attachments

  • WhatsApp Image 2017-09-02 at 17.43.42.jpeg
    WhatsApp Image 2017-09-02 at 17.43.42.jpeg
    137 KB · Views: 55
  • WhatsApp Image 2017-09-02 at 17.49.34.jpeg
    WhatsApp Image 2017-09-02 at 17.49.34.jpeg
    159.2 KB · Views: 55
Hi,

I would use a live CD to check if your main-board is working proper.

The pictures are showing that your network is failing and also your disk.
When this is true then a possible reason is a broken main-board.
 
This looks to be hard disk related, specifically the first hard disk (system disk).

The server is installed with RAID1 config across sda and sdb (using Proxmox VE installer). I switched the boot drive to sdb and it runs successfully, but it does tend to crash again after a few hours / days.

I am seeing a lot of these errors in dmesg:

Code:
[   10.764968] ata1.00: exception Emask 0x10 SAct 0x78 SErr 0x280100 action 0x6 frozen
[   10.764993] ata1.00: irq_stat 0x08000000, interface fatal error
[   10.765008] ata1: SError: { UnrecovData 10B8B BadCRC }
[   10.765022] ata1.00: failed command: READ FPDMA QUEUED
[   10.765036] ata1.00: cmd 60/98:18:60:9e:d8/00:00:02:00:00/40 tag 3 ncq 77824 in
         res 40/00:28:c0:51:f2/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
[   10.765069] ata1.00: status: { DRDY }
[   10.765079] ata1.00: failed command: READ FPDMA QUEUED
[   10.765092] ata1.00: cmd 60/00:20:a0:9c:ec/01:00:00:00:00/40 tag 4 ncq 131072 in
         res 40/00:28:c0:51:f2/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
[   10.765124] ata1.00: status: { DRDY }
[   10.765134] ata1.00: failed command: READ FPDMA QUEUED
[   10.765147] ata1.00: cmd 60/c0:28:c0:51:f2/00:00:02:00:00/40 tag 5 ncq 98304 in
         res 40/00:28:c0:51:f2/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
[   10.765179] ata1.00: status: { DRDY }
[   10.765473] ata1.00: failed command: READ FPDMA QUEUED
[   10.765763] ata1.00: cmd 60/20:30:58:5e:b8/00:00:02:00:00/40 tag 6 ncq 16384 in
         res 40/00:28:c0:51:f2/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
[   10.766343] ata1.00: status: { DRDY }
[   10.766639] ata1: hard resetting link
[   11.085003] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   11.086629] ata1.00: supports DRM functions and may not be fully accessible
[   11.086885] ata1.00: disabling queued TRIM support
[   11.087432] ata1.00: supports DRM functions and may not be fully accessible
[   11.087611] ata1.00: disabling queued TRIM support
[   11.087926] ata1.00: configured for UDMA/133
[   11.087934] ata1: EH complete
[   11.116938] ata1.00: exception Emask 0x10 SAct 0x60000000 SErr 0x280100 action 0x6 frozen
[   11.117266] ata1.00: irq_stat 0x08000000, interface fatal error
[   11.117599] ata1: SError: { UnrecovData 10B8B BadCRC }
[   11.117931] ata1.00: failed command: READ FPDMA QUEUED
[   11.118271] ata1.00: cmd 60/00:e8:a0:3b:b8/01:00:02:00:00/40 tag 29 ncq 131072 in
         res 40/00:e8:a0:3b:b8/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
[   11.118983] ata1.00: status: { DRDY }
[   11.119348] ata1.00: failed command: READ FPDMA QUEUED
[   11.119723] ata1.00: cmd 60/00:f0:a0:3c:b8/01:00:02:00:00/40 tag 30 ncq 131072 in
         res 40/00:e8:a0:3b:b8/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
[   11.120548] ata1.00: status: { DRDY }
[   11.121012] ata1: hard resetting link
[   11.444933] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   11.446586] ata1.00: supports DRM functions and may not be fully accessible
[   11.446871] ata1.00: disabling queued TRIM support
[   11.447438] ata1.00: supports DRM functions and may not be fully accessible
[   11.447617] ata1.00: disabling queued TRIM support
[   11.447947] ata1.00: configured for UDMA/133
[   11.447954] ata1: EH complete
[   11.460435] SGI XFS with ACLs, security attributes, realtime, no debug enabled
[   11.463566] XFS (sdc1): Mounting V4 Filesystem
[   11.476903] ata1: limiting SATA link speed to 3.0 Gbps
[   11.476906] ata1.00: exception Emask 0x10 SAct 0x8000000 SErr 0x280100 action 0x6 frozen
[   11.477374] ata1.00: irq_stat 0x08000000, interface fatal error
[   11.477873] ata1: SError: { UnrecovData 10B8B BadCRC }
[   11.478320] ata1.00: failed command: READ FPDMA QUEUED
[   11.478768] ata1.00: cmd 60/f0:d8:d8:4b:f2/00:00:02:00:00/40 tag 27 ncq 122880 in
         res 40/00:d8:d8:4b:f2/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
[   11.479652] ata1.00: status: { DRDY }
[   11.480098] ata1: hard resetting link
[   11.485963] XFS (sdc1): Starting recovery (logdev: internal)
[   11.493679] XFS (sdc1): Ending recovery (logdev: internal)
[   11.796903] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[   11.798510] ata1.00: supports DRM functions and may not be fully accessible
[   11.798772] ata1.00: disabling queued TRIM support
[   11.799405] ata1.00: supports DRM functions and may not be fully accessible
[   11.799614] ata1.00: disabling queued TRIM support
[   11.799980] ata1.00: configured for UDMA/133
[   11.799986] ata1: EH complete
[   12.017924] systemd-sysv-generator[2802]: Ignoring creation of an alias umountiscsi.service for itself
[   12.053838] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   12.056605] systemd-sysv-generator[2839]: Ignoring creation of an alias umountiscsi.service for itself
[   12.069394] ip_set: protocol 6
[   12.088613] systemd-sysv-generator[2854]: Ignoring creation of an alias umountiscsi.service for itself
[   12.132898] XFS (sdb1): Mounting V4 Filesystem
[   12.148120] XFS (sdb1): Starting recovery (logdev: internal)
[   12.164455] XFS (sdb1): Ending recovery (logdev: internal)
[   12.258317] systemd-sysv-generator[2889]: Ignoring creation of an alias umountiscsi.service for itself
[   12.289102] systemd-sysv-generator[2901]: Ignoring creation of an alias umountiscsi.service for itself
[   12.319213] systemd-sysv-generator[2913]: Ignoring creation of an alias umountiscsi.service for itself
[   32.746598] XFS (sdd1): Mounting V4 Filesystem
[   32.763609] XFS (sdd1): Starting recovery (logdev: internal)
[   32.781957] XFS (sdd1): Ending recovery (logdev: internal)
[   32.876777] systemd-sysv-generator[3292]: Ignoring creation of an alias umountiscsi.service for itself
[   32.908547] systemd-sysv-generator[3304]: Ignoring creation of an alias umountiscsi.service for itself
[   32.944236] systemd-sysv-generator[3316]: Ignoring creation of an alias umountiscsi.service for itself

So I would like to replace sda drive. What is the procedure for this (given the RAID1 setup from the installer)? I would like to avoid reinstalling PVE from scratch if possible.
 
I have similar error, my server proxmox begin to frozen and after I lost connectivity and I need reboot manually.

when I start again my node, I receive this alert:

dmesg |grep ata8
[ 8.097820] ata8: SATA max UDMA/133 abar m524288@0x92200000 port 0x92200180 irq 68
[ 8.411662] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 8.412367] ata8.00: supports DRM functions and may not be fully accessible
[ 8.412392] ata8.00: ATA-10: CT500MX500SSD1, M3CR020, max UDMA/133
[ 8.412394] ata8.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[ 8.412533] ata8.00: READ LOG DMA EXT failed, trying PIO
[ 8.412534] ata8.00: failed to get Identify Device Data, Emask 0x40
[ 8.412535] ata8.00: ATA Identify Device Log not supported
[ 8.412536] ata8.00: Security Log not supported
[ 8.412538] ata8.00: failed to set xfermode (err_mask=0x40)
[ 13.855233] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 13.856024] ata8.00: supports DRM functions and may not be fully accessible
[ 13.856632] ata8.00: supports DRM functions and may not be fully accessible
[ 13.857113] ata8.00: configured for UDMA/133


6.695827] pci 0000:01:00.0: BAR 7: failed to assign [mem size 0x00100000 64bit]
[ 6.695831] pci 0000:01:00.0: BAR 10: failed to assign [mem size 0x00100000 64bit]
[ 6.695834] pci 0000:01:00.1: BAR 7: failed to assign [mem size 0x00100000 64bit]
[ 6.695837] pci 0000:01:00.1: BAR 10: failed to assign [mem size 0x00100000 64bit]
[ 8.412533] ata8.00: READ LOG DMA EXT failed, trying PIO
[ 8.412534] ata8.00: failed to get Identify Device Data, Emask 0x40
[ 8.412538] ata8.00: failed to set xfermode (err_mask=0x40)


# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.10.17-2-pve: 4.10.17-20
ceph: 12.2.8-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-33
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1



please somebody can help me to understand the issue.

Thanks