openvz container hangs when using bacula backup over a bind-mounted cifs share

RMM

Active Member
Oct 25, 2013
30
1
28
We are using proxmox 3.1.
proxmox-ve-2.6.32: 3.1-114 (running kernel: 2.6.32-26-pve)
pve-manager: 3.1-24 (running version: 3.1-24/060bd5a6)
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-2
pve-cluster: 3.0-8
qemu-server: 3.1-8
pve-firmware: 1.0-23
libpve-common-perl: 3.0-9
libpve-access-control: 3.0-8
libpve-storage-perl: 3.0-18
pve-libspice-server1: 0.12.4-2
vncterm: 1.1-6
vzctl: 4.0-1pve4
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-17
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.1-1

We use an openvz container as bacula backup server. The backups are made to a NAS device, which is mounted using cifs. Over "mound --bind" we use it insider the container.
Small test backups work without any problems, but as soon as we do a real backup the container always hangs after a few minutes. => It isn't possible to connect over ssh anymore (it asks for the password, but after entering it, it hangs), also no other service responds.
Over the proxmox web interface it isn't possible to shut the container down. Only stopping it works, but also gives a short error message:
(Once I had to restart the whole server to kill the container, but I couldn't reproduce this behaviour, so maybe there was an other independent problem)

Removing stale lock file /var/lib/vz/lock/100.lck
Killing container ...
Container was stopped
Can't umount /var/lib/vz/root/100: Device or resource busy
TASK ERROR: command 'vzctl stop 100 --fast' failed: exit code 51
Manualy unmounting it afterwards works.
Dmesg also gives following messages
INFO: task sshd:438586 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sshd D ffff880beda4e200 0 438586 6761 100 0x00000000
ffff8810766c3e68 0000000000000082 ffff8810766c3f18 ffffffff811b179e
ffff8810766c3e28 00000000ffffff9c ffff8810766c3e70 ffff880394922000
ffff8810766c3e08 000000012a50c760 ffff8810766c3fd8 ffff8810766c3fd8
Call Trace:
[<ffffffff811b179e>] ? do_filp_open+0x14e/0xca0
[<ffffffff811ad443>] ? putname+0x33/0x50
[<ffffffff815427c2>] __mutex_lock_slowpath+0x122/0x170
[<ffffffff815421da>] mutex_lock+0x2a/0x50
[<ffffffff8119f616>] generic_file_llseek+0x36/0x70
[<ffffffff8119f6c7>] sys_lseek+0x77/0xb0
[<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
INFO: task sshd:438586 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sshd D ffff880beda4e200 0 438586 6761 100 0x00000000
ffff8810766c3e68 0000000000000082 ffff8810766c3f18 ffffffff811b179e
ffff8810766c3e28 00000000ffffff9c ffff8810766c3e70 ffff880394922000
ffff8810766c3e08 000000012a50c760 ffff8810766c3fd8 ffff8810766c3fd8
Call Trace:
[<ffffffff811b179e>] ? do_filp_open+0x14e/0xca0
[<ffffffff811ad443>] ? putname+0x33/0x50
[<ffffffff815427c2>] __mutex_lock_slowpath+0x122/0x170
[<ffffffff815421da>] mutex_lock+0x2a/0x50
[<ffffffff8119f616>] generic_file_llseek+0x36/0x70
[<ffffffff8119f6c7>] sys_lseek+0x77/0xb0
[<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
INFO: task sshd:443157 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sshd D ffff8803957f3460 0 443157 6761 100 0x00000004
ffff880394ff9e68 0000000000000082 0000000000000000 ffffffff811b179e
ffff880394ff9e28 00000000ffffff9c ffff880394ff9e70 ffff880397a46000
ffff880394ff9e08 000000012a52a35e ffff880394ff9fd8 ffff880394ff9fd8
Call Trace:
[<ffffffff811b179e>] ? do_filp_open+0x14e/0xca0
[<ffffffff815427c2>] __mutex_lock_slowpath+0x122/0x170
[<ffffffff815421da>] mutex_lock+0x2a/0x50
[<ffffffff8119f616>] generic_file_llseek+0x36/0x70
[<ffffffff8119f6c7>] sys_lseek+0x77/0xb0
[<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
CT: 100: stopped
CT: 100: started
INFO: task cron:478585 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cron D ffff88087a1d8880 0 478585 446520 100 0x00000000
ffff880377bebe68 0000000000000082 0000000000000000 ffffffff811b179e
ffff880377bebe28 00000000ffffff9c ffff880377bebe70 ffff8810781b9000
ffff880377bebe08 000000012ad6c419 ffff880377bebfd8 ffff880377bebfd8
Call Trace:
[<ffffffff811b179e>] ? do_filp_open+0x14e/0xca0
[<ffffffff815427c2>] __mutex_lock_slowpath+0x122/0x170
[<ffffffff815421da>] mutex_lock+0x2a/0x50
[<ffffffff8119f616>] generic_file_llseek+0x36/0x70
[<ffffffff8119f6c7>] sys_lseek+0x77/0xb0
[<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
INFO: task cron:478585 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cron D ffff88087a1d8880 0 478585 446520 100 0x00000000
ffff880377bebe68 0000000000000082 0000000000000000 ffffffff811b179e
ffff880377bebe28 00000000ffffff9c ffff880377bebe70 ffff8810781b9000
ffff880377bebe08 000000012ad6c419 ffff880377bebfd8 ffff880377bebfd8
Call Trace:
[<ffffffff811b179e>] ? do_filp_open+0x14e/0xca0
[<ffffffff815427c2>] __mutex_lock_slowpath+0x122/0x170
[<ffffffff815421da>] mutex_lock+0x2a/0x50
[<ffffffff8119f616>] generic_file_llseek+0x36/0x70
[<ffffffff8119f6c7>] sys_lseek+0x77/0xb0
[<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
INFO: task cron:478585 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cron D ffff88087a1d8880 0 478585 446520 100 0x00000000
ffff880377bebe68 0000000000000082 0000000000000000 ffffffff811b179e
ffff880377bebe28 00000000ffffff9c ffff880377bebe70 ffff8810781b9000
ffff880377bebe08 000000012ad6c419 ffff880377bebfd8 ffff880377bebfd8
Call Trace:
[<ffffffff811b179e>] ? do_filp_open+0x14e/0xca0
[<ffffffff815427c2>] __mutex_lock_slowpath+0x122/0x170
[<ffffffff815421da>] mutex_lock+0x2a/0x50
[<ffffffff8119f616>] generic_file_llseek+0x36/0x70
[<ffffffff8119f6c7>] sys_lseek+0x77/0xb0
[<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
INFO: task cron:478585 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cron D ffff88087a1d8880 0 478585 446520 100 0x00000000
ffff880377bebe68 0000000000000082 0000000000000000 ffffffff811b179e
ffff880377bebe28 00000000ffffff9c ffff880377bebe70 ffff8810781b9000
ffff880377bebe08 000000012ad6c419 ffff880377bebfd8 ffff880377bebfd8
Call Trace:
[<ffffffff811b179e>] ? do_filp_open+0x14e/0xca0
[<ffffffff815427c2>] __mutex_lock_slowpath+0x122/0x170
[<ffffffff815421da>] mutex_lock+0x2a/0x50
[<ffffffff8119f616>] generic_file_llseek+0x36/0x70
[<ffffffff8119f6c7>] sys_lseek+0x77/0xb0
[<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
INFO: task cron:478585 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cron D ffff88087a1d8880 0 478585 446520 100 0x00000000
ffff880377bebe68 0000000000000082 0000000000000000 ffffffff811b179e
ffff880377bebe28 00000000ffffff9c ffff880377bebe70 ffff8810781b9000
ffff880377bebe08 000000012ad6c419 ffff880377bebfd8 ffff880377bebfd8
Call Trace:
[<ffffffff811b179e>] ? do_filp_open+0x14e/0xca0
[<ffffffff815427c2>] __mutex_lock_slowpath+0x122/0x170
[<ffffffff815421da>] mutex_lock+0x2a/0x50
[<ffffffff8119f616>] generic_file_llseek+0x36/0x70
[<ffffffff8119f6c7>] sys_lseek+0x77/0xb0
[<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
INFO: task cron:478585 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cron D ffff88087a1d8880 0 478585 446520 100 0x00000000
ffff880377bebe68 0000000000000082 0000000000000000 ffffffff811b179e
ffff880377bebe28 00000000ffffff9c ffff880377bebe70 ffff8810781b9000
ffff880377bebe08 000000012ad6c419 ffff880377bebfd8 ffff880377bebfd8
Call Trace:
[<ffffffff811b179e>] ? do_filp_open+0x14e/0xca0
[<ffffffff815427c2>] __mutex_lock_slowpath+0x122/0x170
[<ffffffff815421da>] mutex_lock+0x2a/0x50
[<ffffffff8119f616>] generic_file_llseek+0x36/0x70
[<ffffffff8119f6c7>] sys_lseek+0x77/0xb0
[<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
INFO: task cron:478585 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cron D ffff88087a1d8880 0 478585 446520 100 0x00000000
ffff880377bebe68 0000000000000082 0000000000000000 ffffffff811b179e
ffff880377bebe28 00000000ffffff9c ffff880377bebe70 ffff8810781b9000
ffff880377bebe08 000000012ad6c419 ffff880377bebfd8 ffff880377bebfd8
Call Trace:
[<ffffffff811b179e>] ? do_filp_open+0x14e/0xca0
[<ffffffff815427c2>] __mutex_lock_slowpath+0x122/0x170
[<ffffffff815421da>] mutex_lock+0x2a/0x50
[<ffffffff8119f616>] generic_file_llseek+0x36/0x70
[<ffffffff8119f6c7>] sys_lseek+0x77/0xb0
[<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
CT: 100: stopped

Backups from the host system over vzdump to the NAS device using the same cifs mount work without any problems.
anybody knows a solution? if not I'm going to install the debian system as a kvm host.
 
We are using proxmox 3.1.


We use an openvz container as bacula backup server. The backups are made to a NAS device, which is mounted using cifs. Over "mound --bind" we use it insider the container.
Small test backups work without any problems, but as soon as we do a real backup the container always hangs after a few minutes. => It isn't possible to connect over ssh anymore (it asks for the password, but after entering it, it hangs), also no other service responds.
Over the proxmox web interface it isn't possible to shut the container down. Only stopping it works, but also gives a short error message:
(Once I had to restart the whole server to kill the container, but I couldn't reproduce this behaviour, so maybe there was an other independent problem)


Manualy unmounting it afterwards works.
Dmesg also gives following messages


Backups from the host system over vzdump to the NAS device using the same cifs mount work without any problems.
anybody knows a solution? if not I'm going to install the debian system as a kvm host.
Hi,
I do simmiliar things without trouble, but the bind mounts are "local" Filesystems on the SAN. As backup-device I use an LTO-drive.

I assume the problem ist the high IO which bacula do - see this screenshot about the writes on the spool-disk:
io_spool.png
cifs is also not the best idea for high IO... have you an chance to switch to nfs?

Udo
 
thanks a lot for your answer.
I tested nfs first but for whatever reason got the double throughput with cifs. But first thing tomorrow I'll try it out with nfs. If it really depends on the filesstem used, I can still use the NAS device as an iscsi target.
(I let the backup machine run for 36 hours now, but it never responded again, and it also didn't write more than a few MB onto the network share...)
 
thanks man. you saved me.
Over NFS it's working without any problems (even it's quite slow).
Over cifs it never managed to write more than 2MB before it hung...
No idea if that's a bug in the cifs implementation or in openvz.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!