Unable to start VM's on one node

IH Tech

New Member
Mar 1, 2016
1
0
1
52
We have worked for hours on this and have not been able to solve it. Can anyone assist?

One of our nodes fails to start its OSD disks after a reboot. This in turn prevents any vm from starting up. This is a live server.

Mar 28 22:01:22 pm3 systemd[1]: Unit ceph.service entered failed state.
Mar 28 22:09:00 pm3 systemd[1]: Unit ceph-mon.2.1459218879.795083638.service entered failed state.
Mar 28 22:10:49 pm3 console-setup[1642]: failed.
Mar 28 22:10:49 pm3 kernel: [ 2.605140] ata6.00: READ LOG DMA EXT failed, trying unqueued
Mar 28 22:10:49 pm3 kernel: [ 2.605167] ata6.00: failed to get NCQ Send/Recv Log Emask 0x1
Mar 28 22:10:49 pm3 kernel: [ 2.605456] ata6.00: failed to get NCQ Send/Recv Log Emask 0x1
Mar 28 22:10:49 pm3 pmxcfs[1795]: [quorum] crit: quorum_initialize failed: 2
Mar 28 22:10:49 pm3 pmxcfs[1795]: [confdb] crit: cmap_initialize failed: 2
Mar 28 22:10:49 pm3 pmxcfs[1795]: [dcdb] crit: cpg_initialize failed: 2
Mar 28 22:10:49 pm3 pmxcfs[1795]: [status] crit: cpg_initialize failed: 2
Mar 28 22:10:49 pm3 pvecm[1798]: ipcc_send_rec failed: Connection refused
Mar 28 22:10:49 pm3 pvecm[1798]: ipcc_send_rec failed: Connection refused
Mar 28 22:10:49 pm3 pvecm[1798]: ipcc_send_rec failed: Connection refused
Mar 28 22:11:20 pm3 ceph[1891]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.5 --keyring=/var/lib/ceph/osd/ceph-5/keyring osd crush create-or-move -- 5 3.64 host=pm3 root=default'
Mar 28 22:11:20 pm3 ceph[1891]: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.5']' returned non-zero exit status 1
Mar 28 22:11:50 pm3 ceph[1891]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.7 --keyring=/var/lib/ceph/osd/ceph-7/keyring osd crush create-or-move -- 7 3.64 host=pm3 root=default'
Mar 28 22:11:50 pm3 ceph[1891]: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.7']' returned non-zero exit status 1
Mar 28 22:12:21 pm3 ceph[1891]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.9 --keyring=/var/lib/ceph/osd/ceph-9/keyring osd crush create-or-move -- 9 3.64 host=pm3 root=default'
Mar 28 22:12:21 pm3 ceph[1891]: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.9']' returned non-zero exit status 1
Mar 28 22:12:51 pm3 ceph[1891]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.11 --keyring=/var/lib/ceph/osd/ceph-11/keyring osd crush create-or-move -- 11 3.64 host=pm3 root=default'
Mar 28 22:12:51 pm3 ceph[1891]: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.11']' returned non-zero exit status 1
Mar 28 22:12:51 pm3 ceph[1891]: ceph-disk: Error: One or more partitions failed to activate
 
hi,

Probably stating the obvious but, based on the information provided, I would probably focus on the these errors for a bit:

Mar 28 22:10:49 pm3 kernel: [ 2.605140] ata6.00: READ LOG DMA EXT failed, trying unqueued
Mar 28 22:10:49 pm3 kernel: [ 2.605167] ata6.00: failed to get NCQ Send/Recv Log Emask 0x1
Mar 28 22:10:49 pm3 kernel: [ 2.605456] ata6.00: failed to get NCQ Send/Recv Log Emask 0x1

You wouldn't happen to be using Samsung SSDs?
Does the kernel see all disks in the server?