Can't Start any CTs in cluster after performing latest updates

psionic · Nov 10, 2019

After Updating my 4 node cluster today, I can no longer start any of my CTs.
Corosync Cluster and Ceph show as healthy.

I created a new Unpriviledged CT after the updates and it works fine

I hope there's a way to fix this and not have to rebuild this cluster...

I get an error when running fsck...
pct fsck 4000
fsck from util-linux 2.33.1
fsck.ext2: Unable to resolve 'rbd:Ceph-CT/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT.keyring'
command 'fsck -a -l 'rbd:Ceph-CT/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT.keyring'' failed: exit code 8

ceph status
cluster:
id: e43a583a-2e95-46df-af6b-58574ce1187d
health: HEALTH_OK

services:
mon: 4 daemons, quorum pve11,pve12,pve13,pve14 (age 35m)
mgr: pve11(active, since 94m), standbys: pve14, pve12, pve13
osd: 16 osds: 16 up (since 36m), 16 in (since 4w)

data:
pools: 2 pools, 512 pgs
objects: 508.84k objects, 1.8 TiB
usage: 5.2 TiB used, 24 TiB / 29 TiB avail
pgs: 512 active+clean

io:
client: 5.3 KiB/s wr, 0 op/s rd, 0 op/s wr

Last line in /var/log/ceph/ceph.log - all lines the same
2019-11-09 18:17:23.837652 mgr.pve11 (mgr.19724137) 2956 : cluster [DBG] pgmap v2981: 512 pgs: 512 active+clean; 1.8 TiB data, 5.2 TiB used, 24 TiB / 29 TiB avail; 6.3 KiB/s wr, 1 op/s

pvecm status
Quorum information
------------------
Date: Sat Nov 9 18:08:58 2019
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000001
Ring ID: 1.1f8
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.1.11 (local)
0x00000002 1 10.10.1.12
0x00000003 1 10.10.1.13
0x00000004 1 10.10.1.14

Package Versions
proxmox-ve: 6.0-2 (running kernel: 5.0.21-4-pve)
pve-manager: 6.0-11 (running version: 6.0-11/2140ef37)
pve-kernel-helper: 6.0-11
pve-kernel-5.0: 6.0-10
pve-kernel-5.0.21-4-pve: 5.0.21-8
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.4-pve1
ceph-fuse: 14.2.4-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-3
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-6
libpve-guest-common-perl: 3.0-2
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-8
pve-cluster: 6.0-7
pve-container: 3.0-10
pve-docs: 6.0-8
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-4
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-4
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-13
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2

Update
I found out how to get my CTs running again, I had to remove the 4 lines I added previously to convert the CT from privileged to unprivileged.
nano /etc/pve/lxc/[container # from proxmox gui].conf

Before
arch: amd64
cores: 1
hostname: Ion-LMN001
memory: 2560
net0: name=eth0,bridge=vmbr0,firewall=1,gw=192.168.2.3,hwaddr=BE:7F:04:FE:E7:F2,ip=192.168.4.1/16,type=veth
onboot: 1
ostype: ubuntu
rootfs: local-lvm:vm-4001-disk-0,size=20G
swap: 256
unprivileged: 1
lxc.mount.entry: /dev/random dev/random none bind,ro 0 0
lxc.mount.entry: /dev/urandom dev/urandom none bind,ro 0 0
lxc.mount.entry: /dev/random var/spool/postfix/dev/random none bind,ro 0 0
lxc.mount.entry: /dev/urandom var/spool/postfix/dev/urandom none bind,ro 0 0

After
arch: amd64
cores: 1
hostname: Ion-LMN001
memory: 2560
net0: name=eth0,bridge=vmbr0,firewall=1,gw=192.168.2.3,hwaddr=BE:7F:04:FE:E7:F2,ip=192.168.4.1/16,type=veth
onboot: 1
ostype: ubuntu
rootfs: local-lvm:vm-4001-disk-0,size=20G
swap: 256
unprivileged: 1

I'm still getting the error when running fsck... (Maybe this is normal?)
pct fsck 4000
fsck from util-linux 2.33.1
fsck.ext2: Unable to resolve 'rbd:Ceph-CT/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT.keyring'
command 'fsck -a -l 'rbd:Ceph-CT/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT.keyring'' failed: exit code 8

Alwin · Dec 19, 2019

I hope this issue is already resolved.

James Pass said:
fsck.ext2: Unable to resolve 'rbd:Ceph-CT/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT.keyring'
command 'fsck -a -l 'rbd:Ceph-CT/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT.keyring'' failed: exit code 8

Does the keyring under /etc/pve/priv/ceph/ exist?

psionic · Dec 20, 2019

Alwin said:
I hope this issue is already resolved.

Does the keyring under /etc/pve/priv/ceph/ exist?

Yes, keyring exists,
But still getting same message,
Still wondering if this is a false error?
proxmox-ve: 6.1-2 (running kernel: 5.3.13-1-pve)

Alwin · Jan 7, 2020

James Pass said:
/etc/pve/priv/ceph/Ceph-CT.keyring

To be sure, does exactly this keyring file exist? And is the content the same as in ceph auth get client.admin | diff -s - /etc/pve/priv/ceph/Ceph-CT.keyring?

psionic · Jan 10, 2020

Alwin said:
To be sure, does exactly this keyring file exist? And is the content the same as in ceph auth get client.admin | diff -s - /etc/pve/priv/ceph/Ceph-CT.keyring?

I changed the pool to Ceph-CT-VM
ls /etc/pve/priv/ceph
Ceph-CT-VM.keyring test.keyring

ceph auth get client.admin | diff -s - /etc/pve/priv/ceph/Ceph-CT-VM.keyring
exported keyring for client.admin
Files - and /etc/pve/priv/ceph/Ceph-CT-VM.keyring are identical

Still getting error:
pct fsck 4000
fsck from util-linux 2.33.1
fsck.ext2: Unable to resolve 'rbd:Ceph-CT-VM/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT-VM.keyring'
command 'fsck -a -l 'rbd:Ceph-CT-VM/vm-4000-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Ceph-CT-VM.keyring'' failed: exit code 8

Any ideas?

Alwin · Jan 13, 2020

I could re-create this on my test system. I have send patches to the mailing list. Should be soon packaged.

In the mean time, as a work around, you can map the corresponding rbd volume (rbd map) and run fsck by hand.

psionic · Jan 13, 2020

Alwin said:
I could re-create this on my test system. I have send patches to the mailing list. Should be soon packaged.

In the mean time, as a work around, you can map the corresponding rbd volume (rbd map) and run fsck by hand.

Alwin, thanks for your help with this...

Search

Search

Can't Start any CTs in cluster after performing latest updates

psionic

Member

Alwin

Proxmox Retired Staff

psionic

Member

Alwin

Proxmox Retired Staff

psionic

Member

Alwin

Proxmox Retired Staff

psionic

Member