Unable to start VM's on one node

IH Tech

New Member
Mar 1, 2016
1
0
1
51
We have worked for hours on this and have not been able to solve it. Can anyone assist?

One of our nodes fails to start its OSD disks after a reboot. This in turn prevents any vm from starting up. This is a live server.

Mar 28 22:01:22 pm3 systemd[1]: Unit ceph.service entered failed state.
Mar 28 22:09:00 pm3 systemd[1]: Unit ceph-mon.2.1459218879.795083638.service entered failed state.
Mar 28 22:10:49 pm3 console-setup[1642]: failed.
Mar 28 22:10:49 pm3 kernel: [ 2.605140] ata6.00: READ LOG DMA EXT failed, trying unqueued
Mar 28 22:10:49 pm3 kernel: [ 2.605167] ata6.00: failed to get NCQ Send/Recv Log Emask 0x1
Mar 28 22:10:49 pm3 kernel: [ 2.605456] ata6.00: failed to get NCQ Send/Recv Log Emask 0x1
Mar 28 22:10:49 pm3 pmxcfs[1795]: [quorum] crit: quorum_initialize failed: 2
Mar 28 22:10:49 pm3 pmxcfs[1795]: [confdb] crit: cmap_initialize failed: 2
Mar 28 22:10:49 pm3 pmxcfs[1795]: [dcdb] crit: cpg_initialize failed: 2
Mar 28 22:10:49 pm3 pmxcfs[1795]: [status] crit: cpg_initialize failed: 2
Mar 28 22:10:49 pm3 pvecm[1798]: ipcc_send_rec failed: Connection refused
Mar 28 22:10:49 pm3 pvecm[1798]: ipcc_send_rec failed: Connection refused
Mar 28 22:10:49 pm3 pvecm[1798]: ipcc_send_rec failed: Connection refused
Mar 28 22:11:20 pm3 ceph[1891]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.5 --keyring=/var/lib/ceph/osd/ceph-5/keyring osd crush create-or-move -- 5 3.64 host=pm3 root=default'
Mar 28 22:11:20 pm3 ceph[1891]: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.5']' returned non-zero exit status 1
Mar 28 22:11:50 pm3 ceph[1891]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.7 --keyring=/var/lib/ceph/osd/ceph-7/keyring osd crush create-or-move -- 7 3.64 host=pm3 root=default'
Mar 28 22:11:50 pm3 ceph[1891]: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.7']' returned non-zero exit status 1
Mar 28 22:12:21 pm3 ceph[1891]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.9 --keyring=/var/lib/ceph/osd/ceph-9/keyring osd crush create-or-move -- 9 3.64 host=pm3 root=default'
Mar 28 22:12:21 pm3 ceph[1891]: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.9']' returned non-zero exit status 1
Mar 28 22:12:51 pm3 ceph[1891]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.11 --keyring=/var/lib/ceph/osd/ceph-11/keyring osd crush create-or-move -- 11 3.64 host=pm3 root=default'
Mar 28 22:12:51 pm3 ceph[1891]: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.11']' returned non-zero exit status 1
Mar 28 22:12:51 pm3 ceph[1891]: ceph-disk: Error: One or more partitions failed to activate
 
hi,

Probably stating the obvious but, based on the information provided, I would probably focus on the these errors for a bit:

Mar 28 22:10:49 pm3 kernel: [ 2.605140] ata6.00: READ LOG DMA EXT failed, trying unqueued
Mar 28 22:10:49 pm3 kernel: [ 2.605167] ata6.00: failed to get NCQ Send/Recv Log Emask 0x1
Mar 28 22:10:49 pm3 kernel: [ 2.605456] ata6.00: failed to get NCQ Send/Recv Log Emask 0x1

You wouldn't happen to be using Samsung SSDs?
Does the kernel see all disks in the server?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!