Unable to start VM's on one node

IH Tech · Mar 29, 2016

We have worked for hours on this and have not been able to solve it. Can anyone assist?

One of our nodes fails to start its OSD disks after a reboot. This in turn prevents any vm from starting up. This is a live server.

Mar 28 22:01:22 pm3 systemd[1]: Unit ceph.service entered failed state.
Mar 28 22:09:00 pm3 systemd[1]: Unit ceph-mon.2.1459218879.795083638.service entered failed state.
Mar 28 22:10:49 pm3 console-setup[1642]: failed.
Mar 28 22:10:49 pm3 kernel: [ 2.605140] ata6.00: READ LOG DMA EXT failed, trying unqueued
Mar 28 22:10:49 pm3 kernel: [ 2.605167] ata6.00: failed to get NCQ Send/Recv Log Emask 0x1
Mar 28 22:10:49 pm3 kernel: [ 2.605456] ata6.00: failed to get NCQ Send/Recv Log Emask 0x1
Mar 28 22:10:49 pm3 pmxcfs[1795]: [quorum] crit: quorum_initialize failed: 2
Mar 28 22:10:49 pm3 pmxcfs[1795]: [confdb] crit: cmap_initialize failed: 2
Mar 28 22:10:49 pm3 pmxcfs[1795]: [dcdb] crit: cpg_initialize failed: 2
Mar 28 22:10:49 pm3 pmxcfs[1795]: [status] crit: cpg_initialize failed: 2
Mar 28 22:10:49 pm3 pvecm[1798]: ipcc_send_rec failed: Connection refused
Mar 28 22:10:49 pm3 pvecm[1798]: ipcc_send_rec failed: Connection refused
Mar 28 22:10:49 pm3 pvecm[1798]: ipcc_send_rec failed: Connection refused
Mar 28 22:11:20 pm3 ceph[1891]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.5 --keyring=/var/lib/ceph/osd/ceph-5/keyring osd crush create-or-move -- 5 3.64 host=pm3 root=default'
Mar 28 22:11:20 pm3 ceph[1891]: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.5']' returned non-zero exit status 1
Mar 28 22:11:50 pm3 ceph[1891]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.7 --keyring=/var/lib/ceph/osd/ceph-7/keyring osd crush create-or-move -- 7 3.64 host=pm3 root=default'
Mar 28 22:11:50 pm3 ceph[1891]: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.7']' returned non-zero exit status 1
Mar 28 22:12:21 pm3 ceph[1891]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.9 --keyring=/var/lib/ceph/osd/ceph-9/keyring osd crush create-or-move -- 9 3.64 host=pm3 root=default'
Mar 28 22:12:21 pm3 ceph[1891]: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.9']' returned non-zero exit status 1
Mar 28 22:12:51 pm3 ceph[1891]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.11 --keyring=/var/lib/ceph/osd/ceph-11/keyring osd crush create-or-move -- 11 3.64 host=pm3 root=default'
Mar 28 22:12:51 pm3 ceph[1891]: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.11']' returned non-zero exit status 1
Mar 28 22:12:51 pm3 ceph[1891]: ceph-disk: Error: One or more partitions failed to activate

Rudo · Mar 29, 2016

hi,

Probably stating the obvious but, based on the information provided, I would probably focus on the these errors for a bit:

IH Tech said:
Mar 28 22:10:49 pm3 kernel: [ 2.605140] ata6.00: READ LOG DMA EXT failed, trying unqueued
Mar 28 22:10:49 pm3 kernel: [ 2.605167] ata6.00: failed to get NCQ Send/Recv Log Emask 0x1
Mar 28 22:10:49 pm3 kernel: [ 2.605456] ata6.00: failed to get NCQ Send/Recv Log Emask 0x1

You wouldn't happen to be using Samsung SSDs?
Does the kernel see all disks in the server?

Search

Search

Unable to start VM's on one node

IH Tech

New Member

Rudo

Renowned Member