I have a Proliant server with 12 drives that are oldish, but it serves my purpose. Recently a prolonged power interruption took down my 4 node cluster. 3 nodes came back up and the ceph is running on 3 nodes, so I'm operational. However, the 4th server doesn't boot.
The machine has 12 disk drives, of which two have failed due to this power failure (and old age!). So I removed them. Two of the other disk have simply lost their partition table info. I used fdisk to put it back (I have multiple disks of the same type), but the disks are not being recognised properly. This all is not a problem, since I have simply format the disks once the server is up again and ceph will recover to a healthy state. However, the machine gets to the detection of the disks...
Without going into lots of detail, I'd simply like to know: How can I tell proxmox to boot and ignore the disk errors and continue booting please?
Code:
# pveversion --verbose
proxmox-ve: 4.4-107 (running kernel: 4.4.98-6-pve)
pve-manager: 4.4-22 (running version: 4.4-22/2728f613)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.4.98-6-pve: 4.4.98-107
pve-kernel-4.4.83-1-pve: 4.4.83-96
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-54
qemu-server: 4.0-115
pve-firmware: 1.1-11
libpve-common-perl: 4.0-96
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.9.1-9~pve4
pve-container: 1.0-104
pve-firewall: 2.0-33
pve-ha-manager: 1.0-41
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
ceph: 10.2.10-1~bpo80+1
The machine has 12 disk drives, of which two have failed due to this power failure (and old age!). So I removed them. Two of the other disk have simply lost their partition table info. I used fdisk to put it back (I have multiple disks of the same type), but the disks are not being recognised properly. This all is not a problem, since I have simply format the disks once the server is up again and ceph will recover to a healthy state. However, the machine gets to the detection of the disks...
Code:
Apr 16 06:26:25 h1 kernel: [2045524.471298] cciss 0000:0a:00.0: cmd ffff880036200280 has CHECK CONDITION sense key = 0x3
Apr 16 06:26:25 h1 kernel: [2045524.471317] blk_update_request: I/O error, dev cciss/c0d11, sector 738760832
Apr 16 06:26:25 h1 kernel: [2045524.480237] XFS (cciss/c0d11p1): metadata I/O error: block 0x2b689080 ("xfs_trans_read_buf_map") error 5 numblks 16
Apr 16 06:26:25 h1 kernel: [2045524.489259] XFS (cciss/c0d11p1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Apr 17 06:26:14 h1 kernel: [2131913.679640] cciss 0000:0a:00.0: cmd ffff880036200280 has CHECK CONDITION sense key = 0x3
Apr 17 06:26:14 h1 kernel: [2131913.679663] blk_update_request: I/O error, dev cciss/c0d11, sector 738760832
Apr 17 06:26:14 h1 kernel: [2131913.688791] XFS (cciss/c0d11p1): metadata I/O error: block 0x2b689080 ("xfs_trans_read_buf_map") error 5 numblks 16
Apr 17 06:26:14 h1 kernel: [2131913.697996] XFS (cciss/c0d11p1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Apr 18 06:26:34 h1 kernel: [2218333.280576] cciss 0000:0a:00.0: cmd ffff880036200000 has CHECK CONDITION sense key = 0x3
Apr 18 06:26:34 h1 kernel: [2218333.280595] blk_update_request: I/O error, dev cciss/c0d11, sector 738760832
Apr 18 06:26:34 h1 kernel: [2218333.289681] XFS (cciss/c0d11p1): metadata I/O error: block 0x2b689080 ("xfs_trans_read_buf_map") error 5 numblks 16
Apr 18 06:26:34 h1 kernel: [2218333.298807] XFS (cciss/c0d11p1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Apr 19 06:26:18 h1 kernel: [2304717.118475] cciss 0000:0a:00.0: cmd ffff880036200000 has CHECK CONDITION sense key = 0x3
Apr 19 06:26:18 h1 kernel: [2304717.118496] blk_update_request: I/O error, dev cciss/c0d11, sector 738760832
Apr 19 06:26:18 h1 kernel: [2304717.127631] XFS (cciss/c0d11p1): metadata I/O error: block 0x2b689080 ("xfs_trans_read_buf_map") error 5 numblks 16
Apr 19 06:26:18 h1 kernel: [2304717.136714] XFS (cciss/c0d11p1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Apr 20 06:26:41 h1 kernel: [2391140.161101] cciss 0000:0a:00.0: cmd ffff880036200000 has CHECK CONDITION sense key = 0x3
Apr 20 06:26:41 h1 kernel: [2391140.161119] blk_update_request: I/O error, dev cciss/c0d11, sector 738760832
Apr 20 06:26:41 h1 kernel: [2391140.170358] XFS (cciss/c0d11p1): metadata I/O error: block 0x2b689080 ("xfs_trans_read_buf_map") error 5 numblks 16
Apr 20 06:26:41 h1 kernel: [2391140.179496] XFS (cciss/c0d11p1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Apr 21 06:26:40 h1 kernel: [2477539.747648] cciss 0000:0a:00.0: cmd ffff880036200000 has CHECK CONDITION sense key = 0x3
Apr 21 06:26:40 h1 kernel: [2477539.747669] blk_update_request: I/O error, dev cciss/c0d11, sector 738760832
Apr 21 06:26:40 h1 kernel: [2477539.756860] XFS (cciss/c0d11p1): metadata I/O error: block 0x2b689080 ("xfs_trans_read_buf_map") error 5 numblks 16
Apr 21 06:26:40 h1 kernel: [2477539.766071] XFS (cciss/c0d11p1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Without going into lots of detail, I'd simply like to know: How can I tell proxmox to boot and ignore the disk errors and continue booting please?