[SOLVED] It seemed failed disks prevent proxmox from booting (but it was actually "noapic" that was needed)

lifeboy

Renowned Member
I have a Proliant server with 12 drives that are oldish, but it serves my purpose. Recently a prolonged power interruption took down my 4 node cluster. 3 nodes came back up and the ceph is running on 3 nodes, so I'm operational. However, the 4th server doesn't boot.

Code:
# pveversion --verbose
proxmox-ve: 4.4-107 (running kernel: 4.4.98-6-pve)
pve-manager: 4.4-22 (running version: 4.4-22/2728f613)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.4.98-6-pve: 4.4.98-107
pve-kernel-4.4.83-1-pve: 4.4.83-96
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-54
qemu-server: 4.0-115
pve-firmware: 1.1-11
libpve-common-perl: 4.0-96
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.9.1-9~pve4
pve-container: 1.0-104
pve-firewall: 2.0-33
pve-ha-manager: 1.0-41
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
ceph: 10.2.10-1~bpo80+1

The machine has 12 disk drives, of which two have failed due to this power failure (and old age!). So I removed them. Two of the other disk have simply lost their partition table info. I used fdisk to put it back (I have multiple disks of the same type), but the disks are not being recognised properly. This all is not a problem, since I have simply format the disks once the server is up again and ceph will recover to a healthy state. However, the machine gets to the detection of the disks...

Code:
Apr 16 06:26:25 h1 kernel: [2045524.471298] cciss 0000:0a:00.0: cmd ffff880036200280 has CHECK CONDITION sense key = 0x3
Apr 16 06:26:25 h1 kernel: [2045524.471317] blk_update_request: I/O error, dev cciss/c0d11, sector 738760832
Apr 16 06:26:25 h1 kernel: [2045524.480237] XFS (cciss/c0d11p1): metadata I/O error: block 0x2b689080 ("xfs_trans_read_buf_map") error 5 numblks 16
Apr 16 06:26:25 h1 kernel: [2045524.489259] XFS (cciss/c0d11p1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Apr 17 06:26:14 h1 kernel: [2131913.679640] cciss 0000:0a:00.0: cmd ffff880036200280 has CHECK CONDITION sense key = 0x3
Apr 17 06:26:14 h1 kernel: [2131913.679663] blk_update_request: I/O error, dev cciss/c0d11, sector 738760832
Apr 17 06:26:14 h1 kernel: [2131913.688791] XFS (cciss/c0d11p1): metadata I/O error: block 0x2b689080 ("xfs_trans_read_buf_map") error 5 numblks 16
Apr 17 06:26:14 h1 kernel: [2131913.697996] XFS (cciss/c0d11p1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Apr 18 06:26:34 h1 kernel: [2218333.280576] cciss 0000:0a:00.0: cmd ffff880036200000 has CHECK CONDITION sense key = 0x3
Apr 18 06:26:34 h1 kernel: [2218333.280595] blk_update_request: I/O error, dev cciss/c0d11, sector 738760832
Apr 18 06:26:34 h1 kernel: [2218333.289681] XFS (cciss/c0d11p1): metadata I/O error: block 0x2b689080 ("xfs_trans_read_buf_map") error 5 numblks 16
Apr 18 06:26:34 h1 kernel: [2218333.298807] XFS (cciss/c0d11p1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Apr 19 06:26:18 h1 kernel: [2304717.118475] cciss 0000:0a:00.0: cmd ffff880036200000 has CHECK CONDITION sense key = 0x3
Apr 19 06:26:18 h1 kernel: [2304717.118496] blk_update_request: I/O error, dev cciss/c0d11, sector 738760832
Apr 19 06:26:18 h1 kernel: [2304717.127631] XFS (cciss/c0d11p1): metadata I/O error: block 0x2b689080 ("xfs_trans_read_buf_map") error 5 numblks 16
Apr 19 06:26:18 h1 kernel: [2304717.136714] XFS (cciss/c0d11p1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Apr 20 06:26:41 h1 kernel: [2391140.161101] cciss 0000:0a:00.0: cmd ffff880036200000 has CHECK CONDITION sense key = 0x3
Apr 20 06:26:41 h1 kernel: [2391140.161119] blk_update_request: I/O error, dev cciss/c0d11, sector 738760832
Apr 20 06:26:41 h1 kernel: [2391140.170358] XFS (cciss/c0d11p1): metadata I/O error: block 0x2b689080 ("xfs_trans_read_buf_map") error 5 numblks 16
Apr 20 06:26:41 h1 kernel: [2391140.179496] XFS (cciss/c0d11p1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Apr 21 06:26:40 h1 kernel: [2477539.747648] cciss 0000:0a:00.0: cmd ffff880036200000 has CHECK CONDITION sense key = 0x3
Apr 21 06:26:40 h1 kernel: [2477539.747669] blk_update_request: I/O error, dev cciss/c0d11, sector 738760832
Apr 21 06:26:40 h1 kernel: [2477539.756860] XFS (cciss/c0d11p1): metadata I/O error: block 0x2b689080 ("xfs_trans_read_buf_map") error 5 numblks 16
Apr 21 06:26:40 h1 kernel: [2477539.766071] XFS (cciss/c0d11p1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.

Without going into lots of detail, I'd simply like to know: How can I tell proxmox to boot and ignore the disk errors and continue booting please?
 
The dmesg output indicate that the disk behind cciss/c0d11 (maybe slot 11 on the server - but I cannot say this for sure (maybe hpacucli or a similar tool could help you)), gets recognized and the XFS filesystem on it gets mounted, however the disk seems to be failing.

Maybe they didn't only lose their partition table, but are actually broken as well.

Does the server boot eventually (do you get a login prompt), or is this the last entry you get in the log/dmesg?

Could you please post the output before the XFS errors?
 
The dmesg output indicate that the disk behind cciss/c0d11 (maybe slot 11 on the server - but I cannot say this for sure (maybe hpacucli or a similar tool could help you)), gets recognized and the XFS filesystem on it gets mounted, however the disk seems to be failing.

Maybe they didn't only lose their partition table, but are actually broken as well.

Does the server boot eventually (do you get a login prompt), or is this the last entry you get in the log/dmesg?
No, it just hangs there. I have already removed cciss/c0d11, but the system is still looking for it...

Could you please post the output before the XFS errors?

That is all that's in the log file. I have to boot from a live CD to access the logs and there's very little in the logs as you can see.

I have decided to trash the node and reinstall it. I wanted to learn what to do, but time is not on my side.