Can't Start any CTs in cluster after performing latest updates

psionic

Member
May 23, 2019
75
9
13
After Updating my 4 node cluster today, I can no longer start any of my CTs.
Corosync Cluster and Ceph show as healthy.

I created a new Unpriviledged CT after the updates and it works fine

I hope there's a way to fix this and not have to rebuild this cluster...

I get an error when running fsck...
pct fsck 4000
fsck from util-linux 2.33.1
fsck.ext2: Unable to resolve 'rbd:Ceph-CT/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT.keyring'
command 'fsck -a -l 'rbd:Ceph-CT/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT.keyring'' failed: exit code 8

ceph status
cluster:
id: e43a583a-2e95-46df-af6b-58574ce1187d
health: HEALTH_OK

services:
mon: 4 daemons, quorum pve11,pve12,pve13,pve14 (age 35m)
mgr: pve11(active, since 94m), standbys: pve14, pve12, pve13
osd: 16 osds: 16 up (since 36m), 16 in (since 4w)

data:
pools: 2 pools, 512 pgs
objects: 508.84k objects, 1.8 TiB
usage: 5.2 TiB used, 24 TiB / 29 TiB avail
pgs: 512 active+clean

io:
client: 5.3 KiB/s wr, 0 op/s rd, 0 op/s wr

Last line in /var/log/ceph/ceph.log - all lines the same
2019-11-09 18:17:23.837652 mgr.pve11 (mgr.19724137) 2956 : cluster [DBG] pgmap v2981: 512 pgs: 512 active+clean; 1.8 TiB data, 5.2 TiB used, 24 TiB / 29 TiB avail; 6.3 KiB/s wr, 1 op/s

pvecm status
Quorum information
------------------
Date: Sat Nov 9 18:08:58 2019
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000001
Ring ID: 1.1f8
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.1.11 (local)
0x00000002 1 10.10.1.12
0x00000003 1 10.10.1.13
0x00000004 1 10.10.1.14

Package Versions
proxmox-ve: 6.0-2 (running kernel: 5.0.21-4-pve)
pve-manager: 6.0-11 (running version: 6.0-11/2140ef37)
pve-kernel-helper: 6.0-11
pve-kernel-5.0: 6.0-10
pve-kernel-5.0.21-4-pve: 5.0.21-8
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.4-pve1
ceph-fuse: 14.2.4-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-3
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-6
libpve-guest-common-perl: 3.0-2
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-8
pve-cluster: 6.0-7
pve-container: 3.0-10
pve-docs: 6.0-8
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-4
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-4
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-13
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2

Update
I found out how to get my CTs running again, I had to remove the 4 lines I added previously to convert the CT from privileged to unprivileged.
nano /etc/pve/lxc/[container # from proxmox gui].conf

Before
arch: amd64
cores: 1
hostname: Ion-LMN001
memory: 2560
net0: name=eth0,bridge=vmbr0,firewall=1,gw=192.168.2.3,hwaddr=BE:7F:04:FE:E7:F2,ip=192.168.4.1/16,type=veth
onboot: 1
ostype: ubuntu
rootfs: local-lvm:vm-4001-disk-0,size=20G
swap: 256
unprivileged: 1
lxc.mount.entry: /dev/random dev/random none bind,ro 0 0
lxc.mount.entry: /dev/urandom dev/urandom none bind,ro 0 0
lxc.mount.entry: /dev/random var/spool/postfix/dev/random none bind,ro 0 0
lxc.mount.entry: /dev/urandom var/spool/postfix/dev/urandom none bind,ro 0 0

After
arch: amd64
cores: 1
hostname: Ion-LMN001
memory: 2560
net0: name=eth0,bridge=vmbr0,firewall=1,gw=192.168.2.3,hwaddr=BE:7F:04:FE:E7:F2,ip=192.168.4.1/16,type=veth
onboot: 1
ostype: ubuntu
rootfs: local-lvm:vm-4001-disk-0,size=20G
swap: 256
unprivileged: 1

I'm still getting the error when running fsck... (Maybe this is normal?)
pct fsck 4000
fsck from util-linux 2.33.1
fsck.ext2: Unable to resolve 'rbd:Ceph-CT/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT.keyring'
command 'fsck -a -l 'rbd:Ceph-CT/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT.keyring'' failed: exit code 8
 
Last edited:
I hope this issue is already resolved.

fsck.ext2: Unable to resolve 'rbd:Ceph-CT/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT.keyring'
command 'fsck -a -l 'rbd:Ceph-CT/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT.keyring'' failed: exit code 8
Does the keyring under /etc/pve/priv/ceph/ exist?
 
I hope this issue is already resolved.


Does the keyring under /etc/pve/priv/ceph/ exist?

Yes, keyring exists,
But still getting same message,
Still wondering if this is a false error?
proxmox-ve: 6.1-2 (running kernel: 5.3.13-1-pve)
 
/etc/pve/priv/ceph/Ceph-CT.keyring
To be sure, does exactly this keyring file exist? And is the content the same as in ceph auth get client.admin | diff -s - /etc/pve/priv/ceph/Ceph-CT.keyring?
 
To be sure, does exactly this keyring file exist? And is the content the same as in ceph auth get client.admin | diff -s - /etc/pve/priv/ceph/Ceph-CT.keyring?

I changed the pool to Ceph-CT-VM
ls /etc/pve/priv/ceph
Ceph-CT-VM.keyring test.keyring

ceph auth get client.admin | diff -s - /etc/pve/priv/ceph/Ceph-CT-VM.keyring
exported keyring for client.admin
Files - and /etc/pve/priv/ceph/Ceph-CT-VM.keyring are identical

Still getting error:
pct fsck 4000
fsck from util-linux 2.33.1
fsck.ext2: Unable to resolve 'rbd:Ceph-CT-VM/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT-VM.keyring'
command 'fsck -a -l 'rbd:Ceph-CT-VM/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT-VM.keyring'' failed: exit code 8

Any ideas?
 
I could re-create this on my test system. I have send patches to the mailing list. Should be soon packaged.

In the mean time, as a work around, you can map the corresponding rbd volume (rbd map) and run fsck by hand.
 
  • Like
Reactions: psionic
I could re-create this on my test system. I have send patches to the mailing list. Should be soon packaged.

In the mean time, as a work around, you can map the corresponding rbd volume (rbd map) and run fsck by hand.

Alwin, thanks for your help with this...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!