Proxmox 7.1.-12 > 7.2-7 Upgrade from Ceph 16.2.7 to Ceph 16.2.9 Snapshot Problems

fstrankowski

Well-Known Member
Nov 28, 2016
72
13
48
39
Hamburg
Good Morning everyone!

Background: We've been running without errors prior to our yesterdays upgrade to 7.2-7 for weeks. Since our upgrade from 7.1-12 to 7.2-7 including the upgrade of Ceph to 16.2.9 we are not able to snapshot our LXC containers anymore, if they are running. This is especially true for our database-clusters (MariaDB). Backing up with suspend, we see no problems.

xxx11111.PNG

Proxmox is unable to create the RBD-Snapshot on the underlaying Ceph storage. Some containers work though, but especially once we start to snapshot our database (write intensive LXC-Container) the snapshot hangs. We have to forcefully reboot our node to resolve the issue.

Anyone has or had similiar issues? These problems started after upgrading from Proxmox 7.1-12 to 7.2-7 with Ceph from 16.2.7 > 16.2.9

proxmox-ve: 7.2-1 (running kernel: 5.15.39-4-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-5.15: 7.2-9
pve-kernel-helper: 7.2-9
pve-kernel-5.13: 7.1-9
pve-kernel-5.4: 6.4-7
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-5-pve: 5.13.19-13
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.4.143-1-pve: 5.4.143-1
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-7
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.5-1
proxmox-backup-file-restore: 2.2.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-11
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1
 
Last edited:
We'd like to share our findings regarding the described problem with backups and LXC snapshots in the latest 7.2-X version of Proxmox in combination of Ceph. After countless hours invested to debug the issue we're sure to have found the root cause of our reported issue.

Summary: We've had problems with backups of running LXC containers with multiple mountpoints in our Proxmox-Ceph-Cluster after upgrading to the latest Proxmox 7.2-7 version. Once a backup started, the kernel panicked ( Kernel tainted ), hung up and the container and host crashed.

Debug: After a shitton debugging we've found that the problem resides in the the fsfreeze_mountpoint function (PVE/LXC/Config.pm) which freezes the RBD mountpoints before a snapshot can happen. The said problem does not happen for the root filesystems of the LXC containers, but in case someone is using multiple mountpoints (mp0, mp1, mpX) a freeze for additional mountpoints will fail.

You can reproduce it on the commandline without touching Proxmox:

Code:
fsfreeze -f /proc/pid-of-lxc/root (OK)
fsfreeze -f /proc/pid-of-lxc/root/<anothermountpoint> (Fail / Hangs)

This can be reproduced on all 5.15. PVE-Kernels including the most recent one (5.15.39-4-pve). A patch has been commited into the official kernel.org 5.15-Kernel tree (see here) which has made it into the official Kernel 5.15.64 and 5.15.65 (and upwards) from Kernel.org. We're now waiting for the Proxmox Team to backport the said patch into the latest Enterprise Kernel available 5.15.39-4-pve so it becomes 5.15.39-5-pve. Oh and just by the way, all 5.15.X Proxmox community kernels are affected aswell (e.g. most recent are 5.15.39-4-pve & 5.15.53-1-pve).

Quick Fix: For the time beeing, a downgrade onto 5.13 fixes the said problem and lets you snapshot LXC mountpoints without hassle. Hope we could be of any help for someone having the same problem.
 
Last edited:
  • Like
Reactions: fiona and Moayad

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!