cannot start container after backup failure

silvered.dragon · Jul 25, 2019

Hi to all,
I'm in a production environment and I'm on a 3 node ceph cluster so I know that I can migrate my vms to the other nodes but in this particular moment I prefer to not migrate anything cause I don't want to restart the affected node. I have a container id 118 that after a backup failure gone in locked state, so I run the following commands

Code:

cpt unlock 118
rbd snap rm ceph/vm-118-disk-1@vzdump

then it becomes unresponsive so I tried to kill the related [lxc-monitor]

Code:

kill -9 {related lxc-monitor-118 pid}

now the container is shutdown but I cannot start again this cause I'm receiving this

Code:

TASK ERROR: command 'systemctl start pve-container@118' failed: exit code 1

if I run ps ax | grep lxc there is a status D tar process that obviuosly I cannot kill

Code:

758371 ?        D      0:55 tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs --xattrs-include=user.* --xattrs-include=security.capability --warning=no-file-ignored --warning=no-xattr-write --one-file-system --warning=no-file-ignored --directory=/mnt/pve/Anekup/dump/vzdump-lxc-118-2019_07_25-03_00_10.tmp ./etc/vzdump/pct.conf --directory=/mnt/vzsnap0 --no-anchored --exclude=lost+found --anchored --exclude=./tmp/?* --exclude=./var/tmp/?* --exclude=./var/run/?*.pid ./

any idea on a way to start again this container without rebboting the node?
many thanks

oguz · Jul 25, 2019

hi,

there's no real way of killing a zombie process (it's already dead). you probably have to reboot your node.

but if you really want to try, you can try unmounting the fs where you were taking the backup, since status D processes are usually caused by I/O problems

edit:

force unmount, `umount -f`

silvered.dragon · Jul 25, 2019

I have already tried this, at the end I restarted the node.. thanks anyway, but problem still remains, every time I try to backup/snapshot this container the tar process wil stuck.

oguz · Jul 25, 2019

silvered.dragon said:
every time I try to backup/snapshot this container the tar process wil stuck.

a couple of places to check:
* container config file (`pct config CTID`)
* /etc/pve/storage.cfg
* syslog during backup

how does it happen exactly? do you get any error messages, or does it just hang out of nowhere?

is it just this ct, or is it the storage?

silvered.dragon · Jul 25, 2019

Thank you @oguz for your reply,
I have a massive backup around 10TB of vms from proxmox cluster to an external huge freenas device through a 10Gb network, the entire backups takes around 4h during the night. This particular container is the one with the highest ID so is the last one backuped. It is an UBUNTU 14.04 LTS container that we use for a vtiger crm instance. If I use the stop mechanism backup everything is going fine but with snapshot I have the issue that in the morning the cluster log is locked on

Code:

 INFO: creating archive '/mnt/pve/Anekup/dump/vzdump-lxc-118-2019_07_22-03_06_29.tar.lzo'

next 10 hours I manualy interrupt the process pressing the stop button. The vm is still working but I have that tar process in D status and If I try to restart the container it becomes unresponsive so I have to reboot the entire node.

ct config file

Code:

arch: amd64
cores: 4
hostname: crm.xxxxxxxxx.com
memory: 4096
nameserver: 192.168.25.62
net0: name=eth0,bridge=vmbr0,gw=192.168.25.62,hwaddr=32:FD:04:01:AD:DC,ip=192.168.25.126/24,type=veth
onboot: 1
ostype: ubuntu
parent: vzdump
rootfs: ceph_ct:vm-118-disk-1,size=80G
searchdomain: xxx
swap: 512

storage config file

Code:

dir: local
        path /var/lib/vz
        content backup,iso,vztmpl
        maxfiles 5
        shared 0

lvmthin: local-lvm
        thinpool data
        vgname pve
        content images,rootdir
        nodes nodo2,nodo1,nodo3

rbd: ceph_vm
        content images
        krbd 0
        nodes nodo3,nodo2,nodo1
        pool ceph

rbd: ceph_ct
        content rootdir
        krbd 1
        nodes nodo2,nodo1,nodo3
        pool ceph

zfspool: local-zfs
        pool rpool/data
        content rootdir,images
        nodes utility
        sparse 1

zfspool: backup-pool
        pool backup_pool
        content rootdir,images
        nodes utility
        sparse 1

cifs: ts_syncro
        path /mnt/pve/ts_syncro
        server 192.168.25.100
        share TS_SYNCRO
        content iso
        nodes nodo1,nodo2,nodo3
        username administrator

nfs: Anekup
        export /mnt/ANEKUP_POOL/Proxmox_Backup
        path /mnt/pve/Anekup
        server 192.168.25.202
        content backup
        maxfiles 10
        options vers=3

The failed backup log

Code:

INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
/dev/rbd5
INFO: creating archive '/mnt/pve/Anekup/dump/vzdump-lxc-118-2019_07_25-03_00_10.tar.lzo'
INFO: remove vzdump snapshot
rbd: sysfs write failed
can't unmap rbd device /dev/rbd/ceph/vm-118-disk-1@vzdump: rbd: sysfs write failed
ERROR: Backup of VM 118 failed - command 'set -o pipefail && tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs '--xattrs-include=user.*' '--xattrs-include=security.capability' '--warning=no-file-ignored' '--warning=no-xattr-write' --one-file-system '--warning=no-file-ignored' '--directory=/mnt/pve/Anekup/dump/vzdump-lxc-118-2019_07_25-03_00_10.tmp' ./etc/vzdump/pct.conf '--directory=/mnt/vzsnap0' --no-anchored '--exclude=lost+found' --anchored '--exclude=./tmp/?*' '--exclude=./var/tmp/?*' '--exclude=./var/run/?*.pid' ./ | lzop >/mnt/pve/Anekup/dump/vzdump-lxc-118-2019_07_25-03_00_10.tar.dat' failed: interrupted by signal
INFO: Failed at 2019-07-25 09:33:34
INFO: Backup job finished with errors

TASK ERROR: job errors

oguz · Jul 25, 2019

silvered.dragon said:
rbd: sysfs write failed can't unmap rbd device /dev/rbd/ceph/vm-118-disk-1@vzdump: rbd: sysfs write failed

looks to be an rbd issue in the end.

i suspect you're hit by bug 1911[0], but to verify i need more infos. but basically, you can upgrade to 6.0 or wait for the new kernel upgrade.

can you send me your `dmesg` output and the syslog during the backup? (remove sensitive information)

[0]: https://bugzilla.proxmox.com/show_bug.cgi?id=1911

silvered.dragon · Jul 25, 2019

Dear thanks againg for your reply,
I always run the latest proxmox version but passing from 5 to 6 is a little difficoult in this moment cause I'm on a production environment in the middle of the highest productive season.. I will upgrade to 6 in the next 2 weeks and check if it fixes the issue, for the moment I can safely run backup in stop mode since this is the only container that is giving errors to me(this is not a critical service).

For the dmesg ans syslog can you please tell me the right way to do this? As I told you the backup takes around 4 hours to complete, how can I grep only the related logs? many thanks

oguz · Jul 25, 2019

`dmesg -H` will give you human-readable output with date and time, you can just copy the relevant parts from there.

syslog will be rotated daily/weekly depending on your setup.

you can use

Code:

grep '^Jul 25 14:3' /var/log/syslog

to limit it to July 25 14:3x for example (30, 31, ..., 39)

or just `Jul 25` for a specific day

silvered.dragon · Jul 25, 2019

thank you I will send this the soon as possible
regards from italy

silvered.dragon · Jul 30, 2019

Hi and sorry for wasting your time, do you think that the just released kernel

proxmox-ve: 5.4-2 (running kernel: 4.15.18-19-pve)

fixes the issue? or I have to update to proxmox 6? I have a little hard time to understand the changelog and it is not clear to me if a fixed kernel is under development for proxmox 5.-2 too.
many thanks

Search

Search

cannot start container after backup failure

silvered.dragon

Renowned Member

oguz

Proxmox Retired Staff

silvered.dragon

Renowned Member

oguz

Proxmox Retired Staff

silvered.dragon

Renowned Member

oguz

Proxmox Retired Staff

silvered.dragon

Renowned Member

oguz

Proxmox Retired Staff

silvered.dragon

Renowned Member

silvered.dragon

Renowned Member

We value your privacy