LXC Backup fails after Ceph Public Network Change

n7qnm · Oct 23, 2024

I just changed my Ceph public network to move it to a separate 2gb network from my "admin" network.

I'm using PBS for backup, running on a VM in my cluster; now, ALL of the backups for my LXCs are failing - the snapshot completes and then backup complains, "can't map drive....".

Help?

gurubert · Oct 24, 2024

Have you moved the MONs to new IP addresses?

Is the Ceph cluster working? What does ceph -s tell you?

Have you changed the MON IPs in the ceph.conf of the PBS host?

n7qnm · Oct 24, 2024

Here's ceph -s
root@pve1:~# ceph -s
cluster:
id: 3fb18b1e-2b99-4f57-b29a-104c16ae2cae
health: HEALTH_WARN
1 daemons have recently crashed

services:
mon: 4 daemons, quorum pve0,pve3,pve2,pve1 (age 6d)
mgr: pve3(active, since 6d), standbys: pve0, pve2, pve1
osd: 4 osds: 4 up (since 6d), 4 in (since 2w)

data:
pools: 2 pools, 33 pgs
objects: 46.52k objects, 175 GiB
usage: 516 GiB used, 3.1 TiB / 3.6 TiB avail
pgs: 33 active+clean

io:
client: 682 B/s rd, 444 KiB/s wr, 0 op/s rd, 40 op/s wr

And ceph.conf
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 192.168.70.20/24
fsid = 3fb18b1e-2b99-4f57-b29a-104c16ae2cae
mon_allow_pool_delete = true
mon_host = 192.168.71.20 192.168.71.23 192.168.71.22 192.168.71.21
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 192.168.71.20/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.pve0]
host = pve0
mds_standby_for_name = pve

[mds.pve1]
host = pve1
mds_standby_for_name = pve

[mds.pve2]
host = pve2
mds_standby_for_name = pve

[mds.pve3]
host = pve3
mds_standby_for_name = pve

[mon.pve0]
public_addr = 192.168.71.20

[mon.pve1]
public_addr = 192.168.71.21

[mon.pve2]
public_addr = 192.168.71.22

[mon.pve3]
public_addr = 192.168.71.23

n7qnm · Oct 24, 2024

And the output of one of the failing (hung) jobs
INFO: starting new backup job: vzdump --prune-backups 'keep-daily=14,keep-last=3,keep-monthly=12,keep-weekly=8' --storage PBS --quiet 1 --all 1 --mode snapshot --fleecing 0 --mailnotification always --notes-template '{{guestname}}' --mailto root@n7qnm.net
INFO: skip external VMs: 100, 104, 107, 108, 109, 112, 115
INFO: Starting Backup of VM 105 (lxc)
INFO: Backup started at 2024-10-22 01:00:00
INFO: status = running
ERROR: Backup of VM 105 failed - CT is locked (backup)
INFO: Failed at 2024-10-22 01:00:00
INFO: Starting Backup of VM 110 (lxc)
INFO: Backup started at 2024-10-22 01:00:00
INFO: status = running
INFO: CT Name: FogPi
INFO: including mount point rootfs ('/') in backup
INFO: found old vzdump snapshot (force removal)
Removing snap: 100% complete...done.
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
Creating snap: 10% complete...
Creating snap: 100% complete...done.

gurubert · Oct 25, 2024

n7qnm said:
ERROR: Backup of VM 105 failed - CT is locked (backup)

Is this CT still locked? Is there a small lock icon visible?
Maybe a previous backup was interrupted and could not unlock the CT.

n7qnm · Oct 25, 2024

Tried unlocking the container - just now tried creating a NEW container and saw this in 'dmesg'
[Fri Oct 25 11:00:08 2024] libceph: osd3 (1)192.168.71.22:6801 socket closed (con state V1_BANNER)
[Fri Oct 25 11:00:08 2024] libceph: wrong peer, want (1)192.168.71.21:6801/61181, got (1)192.168.71.21:6801/3276070075
[Fri Oct 25 11:00:08 2024] libceph: osd2 (1)192.168.71.21:6801 wrong peer at address
[Fri Oct 25 11:00:11 2024] libceph: wrong peer, want (1)192.168.75.20:6801/506820, got (1)192.168.75.20:6801/193039
[Fri Oct 25 11:00:11 2024] libceph: osd0 (1)192.168.75.20:6801 wrong peer at address
[Fri Oct 25 11:00:12 2024] libceph: auth protocol 'cephx' mauth authentication failed: -13
[Fri Oct 25 11:00:13 2024] libceph: osd1 (1)192.168.71.23:6801 socket error on write
[Fri Oct 25 11:00:16 2024] tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 69 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is 618577825 jiffies old.
[Fri Oct 25 11:00:16 2024] Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
[Fri Oct 25 11:00:22 2024] libceph: auth protocol 'cephx' mauth authentication failed: -13

But I CAN create a VM

Search

Search

LXC Backup fails after Ceph Public Network Change

n7qnm

New Member

gurubert

Distinguished Member

n7qnm

New Member

n7qnm

New Member

gurubert

Distinguished Member

n7qnm

New Member

We value your privacy