There seems to be an error on the way the UNIX sockets are handled after an OpenVZ checkpoint / restore. We have detected it with postfix, as the mail system fails after restoring a container with the following messages on the mail.log file:
We've listed the open files before suspend and after restore and the relevant changes between both states is as follows:
As you can see, it seems that the previously unix sock is now unidentified and that makes the pickup program fail.
We believe that this must be happening with all the UNIX sockets, we've also noticed that after a suspend/restore the rsyslog daemon also stops working (it does not log the postfix restart ...
We are not sure if this has been happening for a long time or it's due to the kernel changes introduced with PVE 1.4, as we've only started using checkpointing regularly now.
Code:
Dec 17 17:04:27 clustest postfix/pickup[7575]: warning: connect #8 to subsystem public/cleanup: Connection refused
Dec 17 17:04:37 clustest postfix/pickup[7575]: warning: connect #9 to subsystem public/cleanup: Connection refused
Dec 17 17:04:47 clustest postfix/pickup[7575]: warning: connect #10 to subsystem public/cleanup: Connection refused
Dec 17 17:04:57 clustest postfix/pickup[7575]: fatal: connect #11 to subsystem public/cleanup: Connection refused
Dec 17 17:04:58 clustest postfix/master[6436]: warning: process /usr/lib/postfix/pickup pid 7575 exit status 1
Dec 17 17:04:58 clustest postfix/master[6436]: warning: /usr/lib/postfix/pickup: bad command startup -- throttling
We've listed the open files before suspend and after restore and the relevant changes between both states is as follows:
Code:
--- /tmp/lsof.1 2009-12-17 15:19:12.000000000 +0100
+++ /tmp/lsof.2 2009-12-17 15:19:26.000000000 +0100
@@ -24,8 +24,8 @@
pickup 3686 postfix 0u CHR 1,3 2924977 /dev/null
pickup 3686 postfix 1u CHR 1,3 2924977 /dev/null
pickup 3686 postfix 2u CHR 1,3 2924977 /dev/null
-pickup 3686 postfix 3r FIFO 0,5 152371550 pipe
-pickup 3686 postfix 4w FIFO 0,5 152371550 pipe
-pickup 3686 postfix 5u unix 0xffff8102e7936680 152371436 socket
+pickup 3686 postfix 3r FIFO 0,5 152388187 pipe
+pickup 3686 postfix 4w FIFO 0,5 152388187 pipe
+pickup 3686 postfix 5u sock 0,4 152388121 can't identify protocol
pickup 3686 postfix 6u FIFO 0,28 2883627 /var/spool/postfix/public/pickup
pickup 3686 postfix 7u 0000 0,6 0 6908 anon_inode
As you can see, it seems that the previously unix sock is now unidentified and that makes the pickup program fail.
We believe that this must be happening with all the UNIX sockets, we've also noticed that after a suspend/restore the rsyslog daemon also stops working (it does not log the postfix restart ...
We are not sure if this has been happening for a long time or it's due to the kernel changes introduced with PVE 1.4, as we've only started using checkpointing regularly now.