[SOLVED] Gateway stopped delivering mail

dthompson

Well-Known Member
Nov 23, 2011
146
15
58
Canada
www.digitaltransitions.ca
Hi all,

This morning I had an issue with my PMG server where it went deaf and stopped delivering emails to end email addresses. I use it as a smart host. Its built off of the Proxmox LXC container and its running on a Proxmox Server in a cluster. The server in the syslog went from working to no longer working with no errors from what I can tell. clamd then goes into its thing and in the rest of the logs are similar than the cannot connect to 127.0.01 process.

I'm a little concerned here as there is no rhyme or reason for the mailgateway to stop working and seeing emails in the logs, so I'd like to get a handle on what is causing this issue to make sure it doesn't happen again. I had quite a few angry clients today when they learned a few hours later that their emails were not being sent.

I tried to restart the postfix process to no avail and actually had to restart the entire virtual server to get it to start sending mail again.
Any help and insight to this for anyone who has seen this issue would be greatly appreciated as I'm stumped as to what the issue here is. I'm wondering is this is an issue that I need to give the server more resources to handle the load, but the server never does above 50% of Memory CPU utilization.

Code:
Mar  6 08:20:35 swarmx1 postfix/error[572809]: 11CA240C34: to=<snip@ksk.ca>, relay=none, delay=0.01, delays=0.01/0/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to 127.0.0.1[127.0.0.1]:10023: Connection refused)
Mar  6 08:20:35 swarmx1 postfix/error[572809]: 11CA240C34: to=<snip@ksk.ca>, relay=none, delay=0.01, delays=0.01/0/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to 127.0.0.1[127.0.0.1]:10023: Connection refused)
Mar  6 08:20:35 swarmx1 postfix/error[572809]: 11CA240C34: to=<snip@ksk.ca>, relay=none, delay=0.01, delays=0.01/0/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to 127.0.0.1[127.0.0.1]:10023: Connection refused)
Mar  6 08:20:35 swarmx1 clamd[572615]: Loaded 8975986 signatures.
Mar  6 08:20:37 swarmx1 clamd[572615]: LOCAL: Removing stale socket file /var/run/clamav/clamd.ctl
Mar  6 08:20:37 swarmx1 clamd[572615]: LOCAL: Unix socket file /var/run/clamav/clamd.ctl
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: Global time limit set to 120000 milliseconds.
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: Global size limit set to 100000000 bytes.
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: File size limit set to 25000000 bytes.
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: Recursion level limit set to 5.
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: Files limit set to 1000.
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: MaxEmbeddedPE limit set to 10485760 bytes.
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: MaxHTMLNormalize limit set to 10485760 bytes.
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: MaxHTMLNoTags limit set to 2097152 bytes.
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: MaxScriptNormalize limit set to 5242880 bytes.
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: MaxZipTypeRcg limit set to 1048576 bytes.
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: MaxPartitions limit set to 50.
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: MaxIconsPE limit set to 100.
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: MaxRecHWP3 limit set to 16.
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: PCREMatchLimit limit set to 100000.
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: PCRERecMatchLimit limit set to 2000.
Mar  6 08:20:37 swarmx1 clamd[572615]: Limits: PCREMaxFileSize limit set to 26214400.

In my mail log for around the same time I am seeing this:

Code:
08:20:22 swarmx1 postfix/smtpd[572586]: connect from connect.digidns.ca[192.168.11.10]
Mar  6 08:20:23 swarmx1 postfix/smtpd[572586]: Anonymous TLS connection established from connect.digidns.ca[192.168.11.10]: TLSv1.1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: warning: getpeername: Transport endpoint is not connected -- dropping this connection
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: warning: getpeername: Transport endpoint is not connected -- dropping this connection
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: CONNECT from [198.37.147.137]:48808 to [192.168.11.218]:25
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: WHITELISTED [198.37.147.137]:48808
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: warning: getpeername: Transport endpoint is not connected -- dropping this connection
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: warning: getpeername: Transport endpoint is not connected -- dropping this connection
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: CONNECT from [167.89.14.0]:25497 to [192.168.11.218]:25
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: WHITELISTED [167.89.14.0]:25497
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: warning: getpeername: Transport endpoint is not connected -- dropping this connection
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: CONNECT from [168.245.31.108]:4688 to [192.168.11.218]:25
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: WHITELISTED [168.245.31.108]:4688
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: warning: getpeername: Transport endpoint is not connected -- dropping this connection
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: warning: getpeername: Transport endpoint is not connected -- dropping this connection
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: warning: getpeername: Transport endpoint is not connected -- dropping this connection
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: warning: getpeername: Transport endpoint is not connected -- dropping this connection
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: warning: getpeername: Transport endpoint is not connected -- dropping this connection
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: warning: getpeername: Transport endpoint is not connected -- dropping this connection
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: CONNECT from [167.89.14.0]:52666 to [192.168.11.218]:25
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: WHITELISTED [167.89.14.0]:52666
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: CONNECT from [198.37.147.137]:14122 to [192.168.11.218]:25
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: WHITELISTED [198.37.147.137]:14122
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: CONNECT from [168.245.31.108]:22803 to [192.168.11.218]:25
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: WHITELISTED [168.245.31.108]:22803
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: CONNECT from [52.204.13.182]:54603 to [192.168.11.218]:25
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: warning: getpeername: Transport endpoint is not connected -- dropping this connection
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: HANGUP after 0 from [52.204.13.182]:54603 in tests before SMTP handshake
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: DISCONNECT [52.204.13.182]:54603
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: warning: getpeername: Transport endpoint is not connected -- dropping this connection
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: warning: getpeername: Transport endpoint is not connected -- dropping this connection
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: HANGUP after 0 from [66.163.189.146]:34387 in tests before SMTP handshake
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: DISCONNECT [66.163.189.146]:34387
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: CONNECT from [208.75.123.177]:16638 to [192.168.11.218]:25
Mar  6 08:20:23 swarmx1 postfix/lmtp[572629]: 62DEC40BF5: to=<ssmithers@hhmedia.com>, relay=127.0.0.1[127.0.0.1]:10024, delay=198, delays=0.17/0.23/0.43/197, dsn=4.4.2, status=deferred (lost connection with 127.0.0.1[127.0.0.1] while sending end of data -- message may be sent more than once)
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: warning: getpeername: Transport endpoint is not connected -- dropping this connection
Mar  6 08:20:23 swarmx1 postfix/postscreen[572696]: CONNECT from [167.89.14.205]:18148 to [192.168.11.218]:25
 
check your syslog/journal for daemons that failed
also how much memory did you give to the container? maybe it simply ran out of memory
 
Thanks. I don't see anything pertaining to anything stopping in the logs. I don't believe that it ran out of memory, but I've up the RAM on each server so we'll see if it tailspins after that. I'm actively monitoring it now.
 
So I'm resurrecting this thread.
Last night, I created 2 new servers on KVM as opposed to LXC as my old one were based on. They are running the latest PMG v7 and I backed up and re-imported the configs. The IP addresses are the same, the hostnames, cluster is good.

The master node again goes deaf for no reason. It will run fine for X amount of hours and then it just starts giving me the following:

Oct 5 13:13:53 swarmx1 postfix/error[25801]: 1170961962: to=<snip@domain.ca>, relay=none, delay=0.01, delays=0/0/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to 127.0.0.1[127.0.0.1]:10023: Connection refused)
Oct 5 13:14:32 swarmx1 postfix/error[25803]: 73CBE61966: to=<snip@domain.ca>, relay=none, delay=0.01, delays=0.01/0/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to 127.0.0.1[127.0.0.1]:10023: Connection refused)
Oct 5 13:14:32 swarmx1 postfix/error[25892]: 844DB61968: to=<snip@domain.ca>, relay=none, delay=0.01, delays=0.01/0/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to 127.0.0.1[127.0.0.1]:10023: Connection refused)
Oct 5 13:14:48 swarmx1 postfix/error[25889]: 72CE26196A: to=<snip@domain.ca>, relay=none, delay=0.01, delays=0.01/0/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to 127.0.0.1[127.0.0.1]:10023: Connection refused)

The syslog at those times doesn't show anything out of the ordinary. The only thing I had done previously when it was on LXC was to have the VM reboot nightly and all was good. I was hoping that moving it to the latest version of PMG and running on KVM would resolve the issue, but also it does not.

This issue also only effects the master node, and not the secondary node in the cluster. Everything seems to be OK.

Current setup:
pmg-api/7.0-8/36f2537c (running kernel: 5.11.22-5-pve)
Memory: 4GB
Hard Disk: 12GB (3.1 used)
CPU: 2 sockets, 2 cores (4)
qemu-guest-agent: Yes

Both master and secondary nodes are identically configured. The secondary server is the quarantine server and the master servers is the smart host for sending / receiving emails.


I'd love some help if anyone has any on how to get this issue resolved.
Thank you!
 
Maybe there is not enough RAM in the VM?. Anyway, checks if the filter daemon is running, and restart that if necessary.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!