after daily backup fails Node is not usable

CvH · Oct 19, 2020

Hi,
we have a 4 node cluster with ceph configured and running for years.
Recently starting with around 6.2-6 we got problems that our smb backups regular failing (ZSTD/Snaptshot). Same situation with 6.2-12.
Likely its due a network failure, but why this happens no idea - nothing obvious were found.

some dmesg from that point

Code:

[32528.438433] CIFS VFS: Close unmatched open
[32528.581811] CIFS VFS: No writable handle in writepages rc=-9
[32528.582850] CIFS VFS: No writable handle in writepages rc=-9

Even worse, after the backup is stuck, the lxc/vm is locked and not responding to anything.
Its getting even more problematic

as soon I click "abort the backup", at the proxmox admin gui. The whole node become unresponsive after that.
There is no other possibility besides rebooting the node.
Start/stop/reboot via console of any lxc/vm is not possible anymore (gui/shell/local shell) and even the normal running one are getting slower and getting unresponsive over time.

I know such things are likely not easy to track down and I can't reproduce it at the moment, I have to wait for a failing backup.
Are there any logs that could help to find the underlying issue ? I could collect such logs at the next failing backup.

CvH · Oct 20, 2020

backup failing today again, last entry at dmesg
10.24.12.34 is our synology backup nas

Code:

[Oct20 01:16] CIFS VFS: \\10.24.12.34 Cancelling wait for mid 3907741 cmd: 5
[  +0.000008] CIFS VFS: \\10.24.12.34 Cancelling wait for mid 3907742 cmd: 16
[  +2.378669] CIFS VFS: \\10.24.12.34 Cancelling wait for mid 3907743 cmd: 5
[  +0.000007] CIFS VFS: \\10.24.12.34 Cancelling wait for mid 3907744 cmd: 16
[  +0.275915] CIFS VFS: Close unmatched open
[  +0.000308] CIFS VFS: Close unmatched open

Backupjob (3 other lxc has finished successfully before)

Code:

INFO: Starting Backup of VM 123 (lxc)
INFO: Backup started at 2020-10-20 01:19:44
INFO: status = running
INFO: CT Name: someserver
INFO: including mount point rootfs ('/') in backup
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'

after stopping the backupjob from 123 the next lxc starts the backup (in the meanwhile I disabled the backup at datacenter settings) and that was not stoppable - node gone down and a reboot was necessary

CvH · Nov 4, 2020

After switching to a different backup server with completely different config/hw... and forced to smb3 only it still happens again.
Backup for several LXC/VM containers already finished till it hangs

Code:

INFO: Starting Backup of VM 148 (lxc)
INFO: Backup started at 2020-11-04 01:19:54
INFO: status = running
INFO: CT Name: database
INFO: including mount point rootfs ('/') in backup
INFO: found old vzdump snapshot (force removal)
2020-11-04 01:19:54.512 7fca2cff9700 -1 librbd::object_map::InvalidateRequest: 0x7fca2800ace0 should_complete: r=0
Removing snap: 100% complete...done.
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
/dev/rbd17
INFO: creating vzdump archive '/mnt/pve/neuer_backup_server/dump/vzdump-lxc-148-2020_11_04-01_19_54.tar.zst'
INFO: Total bytes written: 88397977600 (83GiB, 213MiB/s)
INFO: archive file size: 12.39GB
INFO: remove vzdump snapshot

Current time is 08:30, so it is at that sate for ~7 hours.

Now I can't reboot/start/stop anything without crashing the whole Node.
Any idea? If logs/dmesg... is needed just ask.

After some deeper forum search, is it possible that proxmox is bugged if I use zstd for backups ?

CvH · Feb 11, 2021

zstd was a red herring, same happens at every compression level even at current latest proxmox 6.3.x

mira · Feb 11, 2021

Please provide some more information:

Storage config (/etc/pve/storage.cfg)
pveversion -v output
VM config
Syslog/Journal covering the the complete task and some more before/after
mount output

CvH · Feb 12, 2021

Code:

:~# cat /etc/pve/storage.cfg
dir: local
        path /var/lib/vz
        content backup,vztmpl,iso


rbd: ceph
        content rootdir,images
        krbd 1
        pool ceph


cifs: backup_server
        path /mnt/pve/backup_server
        server 10.24.123.123
        share Proxmox
        content iso,vztmpl,snippets,backup
        domain DOMAIN
        prune-backups keep-last=16
        username proxmox-backup

mount:

Code:

:~# mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=65865292k,nr_inodes=16466323,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,noexec,relatime,size=13184608k,mode=755)
/dev/mapper/pve-root on / type ext4 (rw,relatime,errors=remount-ro)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=43942)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
sunrpc on /run/rpc_pipefs type rpc_pipefs (rw,relatime)
mqueue on /dev/mqueue type mqueue (rw,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
configfs on /sys/kernel/config type configfs (rw,relatime)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
tmpfs on /var/lib/ceph/osd/ceph-10 type tmpfs (rw,relatime)
/dev/nvme0n1p1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme2n1p1 on /var/lib/ceph/osd/ceph-0 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
//10.24.123.123/Proxmox on /mnt/pve/backup_server type cifs (rw,relatime,vers=3.0,cache=strict,username=proxmox-backup,domain=DOMAIN,uid=0,noforceuid,gid=0,noforcegid,addr=10.24.123.123,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=60,actimeo=1)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=13184604k,mode=700)

Code:

:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-4.15: 5.4-9
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.16-pve1
ceph-fuse: 14.2.16-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ifupdown2: residual config
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Random LXC that works and sometime fails

Code:

cat /etc/pve/lxc/148.conf
arch: amd64
cores: 12
hostname: datenbank
memory: 68832
net0: name=eth0,bridge=vmbr1,firewall=1,gw=10.24.123.1,hwaddr=00:00:00:00:00:F4,ip=10.24.123.222/22,tag=30,type=veth
onboot: 1
ostype: ubuntu
rootfs: ceph:vm-148-disk-0,size=110G
startup: order=1,up=20
swap: 2048
unprivileged: 1
lxc.prlimit.nofile: 30000

Currently I have just that log, I need to run the backup -> crash the infrastructure to get these logs again, but I need some time to find a proper timeframe for that. Maybe there is already enough to ring some bells.

Code:

INFO: Starting Backup of VM 148 (lxc)
INFO: Backup started at 2020-11-04 01:19:54
INFO: status = running
INFO: CT Name: database
INFO: including mount point rootfs ('/') in backup
INFO: found old vzdump snapshot (force removal)
2020-11-04 01:19:54.512 7fca2cff9700 -1 librbd::object_map::InvalidateRequest: 0x7fca2800ace0 should_complete: r=0
Removing snap: 100% complete...done.
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
/dev/rbd17
INFO: creating vzdump archive '/mnt/pve/neuer_backup_server/dump/vzdump-lxc-148-2020_11_04-01_19_54.tar.zst'
INFO: Total bytes written: 88397977600 (83GiB, 213MiB/s)
INFO: archive file size: 12.39GB
INFO: remove vzdump snapshot

7h later still at lock

What I had tested so far and made no difference
- backupserver completely exchanged (1x Synology, 1x selfmade samba server)
- tested smb 2 and 3 by forcing them
- running scheduled backup jobs just at a single node instead of parallel at all nodes
- changed time when they run
- backup compression
- privileged/unprivileged containers and vm are failing
- backup cluster is working at the same backup infrastructure (mount options etc are identical) without problems

The same setup worked perfect before we upgraded from 6.1 to 6.2 with no additional changes. Initial we were not aware at that this behavior came from the backup itself so. Single lxc/vm backups are working without problems, as soon they run at a backup job the trouble began.
As already said, the buckup runs for some machines well and then it never ends for a machine. As soon I click cancel the proxmox gui goes south. While the machine runs the endless backup the lxc/vm itself is not accessible and looks crashed.

mira · Feb 12, 2021

This looks to be a problem with Ceph as it is stuck on removing the snapshot according to the log.

Code:

2020-11-04 01:19:54.512 7fca2cff9700 -1 librbd::object_map::InvalidateRequest: 0x7fca2800ace0 should_complete: r=0

Can you check the object map of your images and perhaps repair them? (https://docs.ceph.com/en/nautilus/man/8/rbd/)

CvH · Apr 22, 2021

mira said:
Can you check the object map of your images and perhaps repair them? (https://docs.ceph.com/en/nautilus/man/8/rbd/)

sorry for late response (testing is difficult like already stated)

After fixing the object map it looks working again, TX !

Really wondering why those kind of errors are not shown at the ceph monior or somewhere else where it is really viewable.

Search

Search

after daily backup fails Node is not usable

CvH

Member

CvH

Member

CvH

Member

CvH

Member

mira

Proxmox Staff Member

CvH

Member

mira

Proxmox Staff Member

CvH

Member