cannot start container after backup failure

silvered.dragon

Renowned Member
Nov 4, 2015
123
4
83
Hi to all,
I'm in a production environment and I'm on a 3 node ceph cluster so I know that I can migrate my vms to the other nodes but in this particular moment I prefer to not migrate anything cause I don't want to restart the affected node. I have a container id 118 that after a backup failure gone in locked state, so I run the following commands

Code:
cpt unlock 118
rbd snap rm ceph/vm-118-disk-1@vzdump

then it becomes unresponsive so I tried to kill the related [lxc-monitor]
Code:
kill -9 {related lxc-monitor-118 pid}

now the container is shutdown but I cannot start again this cause I'm receiving this

Code:
TASK ERROR: command 'systemctl start pve-container@118' failed: exit code 1

if I run ps ax | grep lxc there is a status D tar process that obviuosly I cannot kill

Code:
758371 ?        D      0:55 tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs --xattrs-include=user.* --xattrs-include=security.capability --warning=no-file-ignored --warning=no-xattr-write --one-file-system --warning=no-file-ignored --directory=/mnt/pve/Anekup/dump/vzdump-lxc-118-2019_07_25-03_00_10.tmp ./etc/vzdump/pct.conf --directory=/mnt/vzsnap0 --no-anchored --exclude=lost+found --anchored --exclude=./tmp/?* --exclude=./var/tmp/?* --exclude=./var/run/?*.pid ./

any idea on a way to start again this container without rebboting the node?
many thanks
 
hi,

there's no real way of killing a zombie process (it's already dead). you probably have to reboot your node.

but if you really want to try, you can try unmounting the fs where you were taking the backup, since status D processes are usually caused by I/O problems

edit:

force unmount, `umount -f`
 
I have already tried this, at the end I restarted the node.. thanks anyway, but problem still remains, every time I try to backup/snapshot this container the tar process wil stuck.
 
every time I try to backup/snapshot this container the tar process wil stuck.

a couple of places to check:
* container config file (`pct config CTID`)
* /etc/pve/storage.cfg
* syslog during backup

how does it happen exactly? do you get any error messages, or does it just hang out of nowhere?

is it just this ct, or is it the storage?
 
Thank you @oguz for your reply,
I have a massive backup around 10TB of vms from proxmox cluster to an external huge freenas device through a 10Gb network, the entire backups takes around 4h during the night. This particular container is the one with the highest ID so is the last one backuped. It is an UBUNTU 14.04 LTS container that we use for a vtiger crm instance. If I use the stop mechanism backup everything is going fine but with snapshot I have the issue that in the morning the cluster log is locked on

Code:
 INFO: creating archive '/mnt/pve/Anekup/dump/vzdump-lxc-118-2019_07_22-03_06_29.tar.lzo'

next 10 hours I manualy interrupt the process pressing the stop button. The vm is still working but I have that tar process in D status and If I try to restart the container it becomes unresponsive so I have to reboot the entire node.

ct config file
Code:
arch: amd64
cores: 4
hostname: crm.xxxxxxxxx.com
memory: 4096
nameserver: 192.168.25.62
net0: name=eth0,bridge=vmbr0,gw=192.168.25.62,hwaddr=32:FD:04:01:AD:DC,ip=192.168.25.126/24,type=veth
onboot: 1
ostype: ubuntu
parent: vzdump
rootfs: ceph_ct:vm-118-disk-1,size=80G
searchdomain: xxx
swap: 512

storage config file
Code:
dir: local
        path /var/lib/vz
        content backup,iso,vztmpl
        maxfiles 5
        shared 0

lvmthin: local-lvm
        thinpool data
        vgname pve
        content images,rootdir
        nodes nodo2,nodo1,nodo3

rbd: ceph_vm
        content images
        krbd 0
        nodes nodo3,nodo2,nodo1
        pool ceph

rbd: ceph_ct
        content rootdir
        krbd 1
        nodes nodo2,nodo1,nodo3
        pool ceph

zfspool: local-zfs
        pool rpool/data
        content rootdir,images
        nodes utility
        sparse 1

zfspool: backup-pool
        pool backup_pool
        content rootdir,images
        nodes utility
        sparse 1

cifs: ts_syncro
        path /mnt/pve/ts_syncro
        server 192.168.25.100
        share TS_SYNCRO
        content iso
        nodes nodo1,nodo2,nodo3
        username administrator

nfs: Anekup
        export /mnt/ANEKUP_POOL/Proxmox_Backup
        path /mnt/pve/Anekup
        server 192.168.25.202
        content backup
        maxfiles 10
        options vers=3

The failed backup log
Code:
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
/dev/rbd5
INFO: creating archive '/mnt/pve/Anekup/dump/vzdump-lxc-118-2019_07_25-03_00_10.tar.lzo'
INFO: remove vzdump snapshot
rbd: sysfs write failed
can't unmap rbd device /dev/rbd/ceph/vm-118-disk-1@vzdump: rbd: sysfs write failed
ERROR: Backup of VM 118 failed - command 'set -o pipefail && tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs '--xattrs-include=user.*' '--xattrs-include=security.capability' '--warning=no-file-ignored' '--warning=no-xattr-write' --one-file-system '--warning=no-file-ignored' '--directory=/mnt/pve/Anekup/dump/vzdump-lxc-118-2019_07_25-03_00_10.tmp' ./etc/vzdump/pct.conf '--directory=/mnt/vzsnap0' --no-anchored '--exclude=lost+found' --anchored '--exclude=./tmp/?*' '--exclude=./var/tmp/?*' '--exclude=./var/run/?*.pid' ./ | lzop >/mnt/pve/Anekup/dump/vzdump-lxc-118-2019_07_25-03_00_10.tar.dat' failed: interrupted by signal
INFO: Failed at 2019-07-25 09:33:34
INFO: Backup job finished with errors

TASK ERROR: job errors
 
rbd: sysfs write failed can't unmap rbd device /dev/rbd/ceph/vm-118-disk-1@vzdump: rbd: sysfs write failed

looks to be an rbd issue in the end.

i suspect you're hit by bug 1911[0], but to verify i need more infos. but basically, you can upgrade to 6.0 or wait for the new kernel upgrade.

can you send me your `dmesg` output and the syslog during the backup? (remove sensitive information)


[0]: https://bugzilla.proxmox.com/show_bug.cgi?id=1911
 
Dear thanks againg for your reply,
I always run the latest proxmox version but passing from 5 to 6 is a little difficoult in this moment cause I'm on a production environment in the middle of the highest productive season.. I will upgrade to 6 in the next 2 weeks and check if it fixes the issue, for the moment I can safely run backup in stop mode since this is the only container that is giving errors to me(this is not a critical service).

For the dmesg ans syslog can you please tell me the right way to do this? As I told you the backup takes around 4 hours to complete, how can I grep only the related logs? many thanks
 
`dmesg -H` will give you human-readable output with date and time, you can just copy the relevant parts from there.

syslog will be rotated daily/weekly depending on your setup.

you can use

Code:
grep '^Jul 25 14:3' /var/log/syslog

to limit it to July 25 14:3x for example (30, 31, ..., 39)

or just `Jul 25` for a specific day
 
Hi and sorry for wasting your time, do you think that the just released kernel

proxmox-ve: 5.4-2 (running kernel: 4.15.18-19-pve)

fixes the issue? or I have to update to proxmox 6? I have a little hard time to understand the changelog and it is not clear to me if a fixed kernel is under development for proxmox 5.-2 too.
many thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!