UNIX sockets dead after chkpnt/restoring an OpenVZ container

iti-asi

Member
Jul 14, 2009
52
0
6
València
www.iti.upv.es
There seems to be an error on the way the UNIX sockets are handled after an OpenVZ checkpoint / restore. We have detected it with postfix, as the mail system fails after restoring a container with the following messages on the mail.log file:
Code:
Dec 17 17:04:27 clustest postfix/pickup[7575]: warning: connect #8 to subsystem public/cleanup: Connection refused
Dec 17 17:04:37 clustest postfix/pickup[7575]: warning: connect #9 to subsystem public/cleanup: Connection refused
Dec 17 17:04:47 clustest postfix/pickup[7575]: warning: connect #10 to subsystem public/cleanup: Connection refused
Dec 17 17:04:57 clustest postfix/pickup[7575]: fatal: connect #11 to subsystem public/cleanup: Connection refused
Dec 17 17:04:58 clustest postfix/master[6436]: warning: process /usr/lib/postfix/pickup pid 7575 exit status 1
Dec 17 17:04:58 clustest postfix/master[6436]: warning: /usr/lib/postfix/pickup: bad command startup -- throttling

We've listed the open files before suspend and after restore and the relevant changes between both states is as follows:

Code:
--- /tmp/lsof.1 2009-12-17 15:19:12.000000000 +0100
+++ /tmp/lsof.2 2009-12-17 15:19:26.000000000 +0100
@@ -24,8 +24,8 @@
 pickup  3686 postfix    0u   CHR                1,3         2924977 /dev/null
 pickup  3686 postfix    1u   CHR                1,3         2924977 /dev/null
 pickup  3686 postfix    2u   CHR                1,3         2924977 /dev/null
-pickup  3686 postfix    3r  FIFO                0,5       152371550 pipe
-pickup  3686 postfix    4w  FIFO                0,5       152371550 pipe
-pickup  3686 postfix    5u  unix 0xffff8102e7936680       152371436 socket
+pickup  3686 postfix    3r  FIFO    0,5       152388187 pipe
+pickup  3686 postfix    4w  FIFO    0,5       152388187 pipe
+pickup  3686 postfix    5u  sock    0,4       152388121 can't identify protocol
 pickup  3686 postfix    6u  FIFO               0,28         2883627 /var/spool/postfix/public/pickup
 pickup  3686 postfix    7u  0000                0,6     0      6908 anon_inode

As you can see, it seems that the previously unix sock is now unidentified and that makes the pickup program fail.

We believe that this must be happening with all the UNIX sockets, we've also noticed that after a suspend/restore the rsyslog daemon also stops working (it does not log the postfix restart ... ;)

We are not sure if this has been happening for a long time or it's due to the kernel changes introduced with PVE 1.4, as we've only started using checkpointing regularly now.
 
It would be great if you find an esay way to reproduce the bug (maybe with rsyslog)? We can the report it to the openvz team.
 
Hi dietmar,

Sorry for not being too clear. In my experience, the flaw is 100% reproducible with a plain chkpnt/restore of a Debian lenny container.

Just to be plain sure, I've downloaded the 64bit version of the Debian lenny template from Proxmox's templates (http://download.proxmox.com/appliances/system/debian-5.0-standard_5.0-1_amd64.tar.gz), and created a new container based on that.

I've started it, and logged in via ssh (to be able to chkpnt it) and done the following:

Code:
container:~# lsof -p `pidof rsyslogd` > lsof1

I've checkpointed and restored the machine from the host:

Code:
host:~# vzctl chkpnt 2920
Setting up checkpoint...
	suspend...
	dump...
	kill...
Container is unmounted
Checkpointing completed succesfully
host:~# vzctl restore 2920
Restoring container ...
Starting container ...
Container is mounted
	undump...
Setting CPU units: 1000
Setting CPUs: 1
Configure meminfo: 262144
Configure veth devices: veth2920.dummy0 veth2920.0 veth2920.1 
Adding interface veth2920.dummy0 to bridge vhbr0 on CT0 for CT2920
Adding interface veth2920.0 to bridge vmbr0 on CT0 for CT2920
Adding interface veth2920.1 to bridge vmbr1 on CT0 for CT2920
	resume...
Container start in progress...
Restore second-level quota
Restoring completed succesfully

After this, and again inside my ssh session to the container:

Code:
container:~# lsof -p `pidof rsyslogd` > lsof2

Diff both lsof outputs, and you'll see that the sockets are unidentified, and unusable by rsyslogd:
Code:
container:~# diff -ub lsof*
--- lsof1	2009-12-18 11:18:54.000000000 +0000
+++ lsof2	2009-12-18 11:19:16.000000000 +0000
@@ -18,7 +18,7 @@
 rsyslogd 316 root  mem    REG              254,2           2294024 /lib/libpthread-2.7.so (path dev=0,28)
 rsyslogd 316 root  mem    REG              254,2           2189531 /usr/lib/libz.so.1.2.3.3 (path dev=0,28)
 rsyslogd 316 root  mem    REG              254,2           2293779 /lib/ld-2.7.so (path dev=0,28)
-rsyslogd 316 root    0u  unix 0xffff8103fee0a9c0         155783827 /dev/log
+rsyslogd 316 root    0u  sock    0,4         155791221 can't identify protocol
 rsyslogd 316 root    1w   REG               0,28    640    2206929 /var/log/auth.log
 rsyslogd 316 root    2w   REG               0,28    789    2205633 /var/log/syslog
 rsyslogd 316 root    3w   REG               0,28      0    2206934 /var/log/daemon.log
@@ -35,5 +35,5 @@
 rsyslogd 316 root   14w   REG               0,28      0    2206942 /var/log/debug
 rsyslogd 316 root   15w   REG               0,28    444    2206943 /var/log/messages
 rsyslogd 316 root   16u  FIFO               0,28           2285903 /dev/xconsole
-rsyslogd 316 root   17u  unix 0xffff8103fee0a680         155783829 /var/spool/postfix/dev/log
+rsyslogd 316 root   17u  sock    0,4         155791223 can't identify protocol
 rsyslogd 316 root   18r   REG               0,30      0 4026532234 /proc/kmsg

As you can see, both UNIX sockets in use by rsyslogd have changed in the output, and it's trivial to see that nothing is getting written to syslog logs.
 
Have you reported confirmed this bug in your setup, or reported the bug upstream? Just to know the bug number so I can keep an eye.

Sorry, I had no time to do further debuging - please can you test with the 2.6.18 kernel we released today (pvetest repository) ? If the bug is still there, can you report it to the openvz team directly?
 
Sorry, I had no time to do further debuging - please can you test with the 2.6.18 kernel we released today (pvetest repository) ? If the bug is still there, can you report it to the openvz team directly?

I've installed the 2.6.18 linux kernel on two nodes and as far as I know that fixes the Unix socket problem: Unix sockets do work after a suspend/resume cycle and after a container live migration.

¿Are there any plans of reviewing/updating the OpenVZ kernel patches for the 2.6.2* or 2.6.3* kenrel versions?
 
But why don't you simply use the 2.6.18 kernel?

Yes, I'm evaluating that option, the question is that I'ld like to be sure that the kernels are supported... which 2.6.18 source is used to build the pve kernel image? I mean, are all known security patches included or should I build my own versions if I want all the patches?
 
Yes, I'm evaluating that option, the question is that I'ld like to be sure that the kernels are supported... which 2.6.18 source is used to build the pve kernel image? I mean, are all known security patches included or should I build my own versions if I want all the patches?

The source is from:

http://download.openvz.org/kernel/branches/rhel5-2.6.18/

should include all security patches (it is based on RHEL 5.4 kernel).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!