UNIX sockets dead after chkpnt/restoring an OpenVZ container

iti-asi · Dec 18, 2009

There seems to be an error on the way the UNIX sockets are handled after an OpenVZ checkpoint / restore. We have detected it with postfix, as the mail system fails after restoring a container with the following messages on the mail.log file:

Code:

Dec 17 17:04:27 clustest postfix/pickup[7575]: warning: connect #8 to subsystem public/cleanup: Connection refused
Dec 17 17:04:37 clustest postfix/pickup[7575]: warning: connect #9 to subsystem public/cleanup: Connection refused
Dec 17 17:04:47 clustest postfix/pickup[7575]: warning: connect #10 to subsystem public/cleanup: Connection refused
Dec 17 17:04:57 clustest postfix/pickup[7575]: fatal: connect #11 to subsystem public/cleanup: Connection refused
Dec 17 17:04:58 clustest postfix/master[6436]: warning: process /usr/lib/postfix/pickup pid 7575 exit status 1
Dec 17 17:04:58 clustest postfix/master[6436]: warning: /usr/lib/postfix/pickup: bad command startup -- throttling

We've listed the open files before suspend and after restore and the relevant changes between both states is as follows:

Code:

--- /tmp/lsof.1 2009-12-17 15:19:12.000000000 +0100
+++ /tmp/lsof.2 2009-12-17 15:19:26.000000000 +0100
@@ -24,8 +24,8 @@
 pickup  3686 postfix    0u   CHR                1,3         2924977 /dev/null
 pickup  3686 postfix    1u   CHR                1,3         2924977 /dev/null
 pickup  3686 postfix    2u   CHR                1,3         2924977 /dev/null
-pickup  3686 postfix    3r  FIFO                0,5       152371550 pipe
-pickup  3686 postfix    4w  FIFO                0,5       152371550 pipe
-pickup  3686 postfix    5u  unix 0xffff8102e7936680       152371436 socket
+pickup  3686 postfix    3r  FIFO    0,5       152388187 pipe
+pickup  3686 postfix    4w  FIFO    0,5       152388187 pipe
+pickup  3686 postfix    5u  sock    0,4       152388121 can't identify protocol
 pickup  3686 postfix    6u  FIFO               0,28         2883627 /var/spool/postfix/public/pickup
 pickup  3686 postfix    7u  0000                0,6     0      6908 anon_inode

As you can see, it seems that the previously unix sock is now unidentified and that makes the pickup program fail.

We believe that this must be happening with all the UNIX sockets, we've also noticed that after a suspend/restore the rsyslog daemon also stops working (it does not log the postfix restart ...

We are not sure if this has been happening for a long time or it's due to the kernel changes introduced with PVE 1.4, as we've only started using checkpointing regularly now.

dietmar · Dec 18, 2009

It would be great if you find an esay way to reproduce the bug (maybe with rsyslog)? We can the report it to the openvz team.

iti-asi · Dec 18, 2009

Hi dietmar,

Sorry for not being too clear. In my experience, the flaw is 100% reproducible with a plain chkpnt/restore of a Debian lenny container.

Just to be plain sure, I've downloaded the 64bit version of the Debian lenny template from Proxmox's templates (http://download.proxmox.com/appliances/system/debian-5.0-standard_5.0-1_amd64.tar.gz), and created a new container based on that.

I've started it, and logged in via ssh (to be able to chkpnt it) and done the following:

Code:

container:~# lsof -p `pidof rsyslogd` > lsof1

I've checkpointed and restored the machine from the host:

Code:

host:~# vzctl chkpnt 2920
Setting up checkpoint...
	suspend...
	dump...
	kill...
Container is unmounted
Checkpointing completed succesfully
host:~# vzctl restore 2920
Restoring container ...
Starting container ...
Container is mounted
	undump...
Setting CPU units: 1000
Setting CPUs: 1
Configure meminfo: 262144
Configure veth devices: veth2920.dummy0 veth2920.0 veth2920.1 
Adding interface veth2920.dummy0 to bridge vhbr0 on CT0 for CT2920
Adding interface veth2920.0 to bridge vmbr0 on CT0 for CT2920
Adding interface veth2920.1 to bridge vmbr1 on CT0 for CT2920
	resume...
Container start in progress...
Restore second-level quota
Restoring completed succesfully

After this, and again inside my ssh session to the container:

Code:

container:~# lsof -p `pidof rsyslogd` > lsof2

Diff both lsof outputs, and you'll see that the sockets are unidentified, and unusable by rsyslogd:

Code:

container:~# diff -ub lsof*
--- lsof1	2009-12-18 11:18:54.000000000 +0000
+++ lsof2	2009-12-18 11:19:16.000000000 +0000
@@ -18,7 +18,7 @@
 rsyslogd 316 root  mem    REG              254,2           2294024 /lib/libpthread-2.7.so (path dev=0,28)
 rsyslogd 316 root  mem    REG              254,2           2189531 /usr/lib/libz.so.1.2.3.3 (path dev=0,28)
 rsyslogd 316 root  mem    REG              254,2           2293779 /lib/ld-2.7.so (path dev=0,28)
-rsyslogd 316 root    0u  unix 0xffff8103fee0a9c0         155783827 /dev/log
+rsyslogd 316 root    0u  sock    0,4         155791221 can't identify protocol
 rsyslogd 316 root    1w   REG               0,28    640    2206929 /var/log/auth.log
 rsyslogd 316 root    2w   REG               0,28    789    2205633 /var/log/syslog
 rsyslogd 316 root    3w   REG               0,28      0    2206934 /var/log/daemon.log
@@ -35,5 +35,5 @@
 rsyslogd 316 root   14w   REG               0,28      0    2206942 /var/log/debug
 rsyslogd 316 root   15w   REG               0,28    444    2206943 /var/log/messages
 rsyslogd 316 root   16u  FIFO               0,28           2285903 /dev/xconsole
-rsyslogd 316 root   17u  unix 0xffff8103fee0a680         155783829 /var/spool/postfix/dev/log
+rsyslogd 316 root   17u  sock    0,4         155791223 can't identify protocol
 rsyslogd 316 root   18r   REG               0,30      0 4026532234 /proc/kmsg

As you can see, both UNIX sockets in use by rsyslogd have changed in the output, and it's trivial to see that nothing is getting written to syslog logs.

iti-asi · Dec 18, 2009

Just for the record, I've just tested this on PVE kernels 2.6.24-8-pve and 2.6.24-9-pve. The behaviour didn't change.

iti-asi · Dec 22, 2009

Hi Dietmar,

Have you reported confirmed this bug in your setup, or reported the bug upstream? Just to know the bug number so I can keep an eye.

dietmar · Dec 22, 2009

iti-asi said:
Have you reported confirmed this bug in your setup, or reported the bug upstream? Just to know the bug number so I can keep an eye.

Sorry, I had no time to do further debuging - please can you test with the 2.6.18 kernel we released today (pvetest repository) ? If the bug is still there, can you report it to the openvz team directly?

iti-asi · Dec 22, 2009

dietmar said:
Sorry, I had no time to do further debuging - please can you test with the 2.6.18 kernel we released today (pvetest repository) ? If the bug is still there, can you report it to the openvz team directly?

I've installed the 2.6.18 linux kernel on two nodes and as far as I know that fixes the Unix socket problem: Unix sockets do work after a suspend/resume cycle and after a container live migration.

¿Are there any plans of reviewing/updating the OpenVZ kernel patches for the 2.6.2* or 2.6.3* kenrel versions?

dietmar · Dec 23, 2009

iti-asi said:
¿Are there any plans of reviewing/updating the OpenVZ kernel patches for the 2.6.2* or 2.6.3* kenrel versions?

Please ask that on the OpenVZ list.

But why don't you simply use the 2.6.18 kernel?

iti-asi · Dec 23, 2009

dietmar said:
But why don't you simply use the 2.6.18 kernel?

Yes, I'm evaluating that option, the question is that I'ld like to be sure that the kernels are supported... which 2.6.18 source is used to build the pve kernel image? I mean, are all known security patches included or should I build my own versions if I want all the patches?

dietmar · Dec 23, 2009

iti-asi said:
Yes, I'm evaluating that option, the question is that I'ld like to be sure that the kernels are supported... which 2.6.18 source is used to build the pve kernel image? I mean, are all known security patches included or should I build my own versions if I want all the patches?

The source is from:

http://download.openvz.org/kernel/branches/rhel5-2.6.18/

should include all security patches (it is based on RHEL 5.4 kernel).

iti-asi · Dec 23, 2009

dietmar said:
The source is from:

http://download.openvz.org/kernel/branches/rhel5-2.6.18/

should include all security patches (it is based on RHEL 5.4 kernel).

Then I'll use the 2.6.18 kernel, thanks a lot for your help.

Search

Search

UNIX sockets dead after chkpnt/restoring an OpenVZ container

iti-asi

Member

dietmar

Proxmox Staff Member

iti-asi

Member

iti-asi

Member

iti-asi

Member

dietmar

Proxmox Staff Member

iti-asi

Member

dietmar

Proxmox Staff Member

iti-asi

Member

dietmar

Proxmox Staff Member

iti-asi

Member

We value your privacy