LXC Backup fails after Ceph Public Network Change

n7qnm

New Member
Jan 4, 2024
17
2
3
Prosser, WA, USA
www.n7qnm.net
I just changed my Ceph public network to move it to a separate 2gb network from my "admin" network.

I'm using PBS for backup, running on a VM in my cluster; now, ALL of the backups for my LXCs are failing - the snapshot completes and then backup complains, "can't map drive....".

Help?
 
Here's ceph -s
root@pve1:~# ceph -s
cluster:
id: 3fb18b1e-2b99-4f57-b29a-104c16ae2cae
health: HEALTH_WARN
1 daemons have recently crashed

services:
mon: 4 daemons, quorum pve0,pve3,pve2,pve1 (age 6d)
mgr: pve3(active, since 6d), standbys: pve0, pve2, pve1
osd: 4 osds: 4 up (since 6d), 4 in (since 2w)

data:
pools: 2 pools, 33 pgs
objects: 46.52k objects, 175 GiB
usage: 516 GiB used, 3.1 TiB / 3.6 TiB avail
pgs: 33 active+clean

io:
client: 682 B/s rd, 444 KiB/s wr, 0 op/s rd, 40 op/s wr

And ceph.conf
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 192.168.70.20/24
fsid = 3fb18b1e-2b99-4f57-b29a-104c16ae2cae
mon_allow_pool_delete = true
mon_host = 192.168.71.20 192.168.71.23 192.168.71.22 192.168.71.21
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 192.168.71.20/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.pve0]
host = pve0
mds_standby_for_name = pve

[mds.pve1]
host = pve1
mds_standby_for_name = pve

[mds.pve2]
host = pve2
mds_standby_for_name = pve

[mds.pve3]
host = pve3
mds_standby_for_name = pve

[mon.pve0]
public_addr = 192.168.71.20

[mon.pve1]
public_addr = 192.168.71.21

[mon.pve2]
public_addr = 192.168.71.22

[mon.pve3]
public_addr = 192.168.71.23
 
And the output of one of the failing (hung) jobs
INFO: starting new backup job: vzdump --prune-backups 'keep-daily=14,keep-last=3,keep-monthly=12,keep-weekly=8' --storage PBS --quiet 1 --all 1 --mode snapshot --fleecing 0 --mailnotification always --notes-template '{{guestname}}' --mailto root@n7qnm.net
INFO: skip external VMs: 100, 104, 107, 108, 109, 112, 115
INFO: Starting Backup of VM 105 (lxc)
INFO: Backup started at 2024-10-22 01:00:00
INFO: status = running
ERROR: Backup of VM 105 failed - CT is locked (backup)
INFO: Failed at 2024-10-22 01:00:00
INFO: Starting Backup of VM 110 (lxc)
INFO: Backup started at 2024-10-22 01:00:00
INFO: status = running
INFO: CT Name: FogPi
INFO: including mount point rootfs ('/') in backup
INFO: found old vzdump snapshot (force removal)
Removing snap: 100% complete...done.
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
Creating snap: 10% complete...
Creating snap: 100% complete...done.
 
Tried unlocking the container - just now tried creating a NEW container and saw this in 'dmesg'
[Fri Oct 25 11:00:08 2024] libceph: osd3 (1)192.168.71.22:6801 socket closed (con state V1_BANNER)
[Fri Oct 25 11:00:08 2024] libceph: wrong peer, want (1)192.168.71.21:6801/61181, got (1)192.168.71.21:6801/3276070075
[Fri Oct 25 11:00:08 2024] libceph: osd2 (1)192.168.71.21:6801 wrong peer at address
[Fri Oct 25 11:00:11 2024] libceph: wrong peer, want (1)192.168.75.20:6801/506820, got (1)192.168.75.20:6801/193039
[Fri Oct 25 11:00:11 2024] libceph: osd0 (1)192.168.75.20:6801 wrong peer at address
[Fri Oct 25 11:00:12 2024] libceph: auth protocol 'cephx' mauth authentication failed: -13
[Fri Oct 25 11:00:13 2024] libceph: osd1 (1)192.168.71.23:6801 socket error on write
[Fri Oct 25 11:00:16 2024] tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 69 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is 618577825 jiffies old.
[Fri Oct 25 11:00:16 2024] Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
[Fri Oct 25 11:00:22 2024] libceph: auth protocol 'cephx' mauth authentication failed: -13

But I CAN create a VM