Service stopping and locking the queue

spamcopbr

New Member
Jul 9, 2022
24
0
1
I'm having problem with the pmg-smtp-filter service crashing sporadically, today i investigated the logs after the last shutdown and gathered the following information:

Mar 01 11:42:11 mx1.spamcop.com.br pmg-smtp-filter[267082]: fast exit to reduce server load (free 5267456 bytes)
Mar 01 11:42:23 mx1.spamcop.com.br systemd[1]: pmg-smtp-filter.service: A process of this unit has been killed by the OOM killer.
Mar 01 11:42:17 mx1.spamcop.com.br postfix/smtpd[267361]: warning: lost connection with proxy 127.0.0.1:10024
Mar 01 11:42:23 mx1.spamcop.com.br systemd[1]: pmg-smtp-filter.service: Killing process 241029 (pmg-smtp-filter) with signal SIGKILL.
Mar 01 11:42:17 mx1.spamcop.com.br postfix/smtpd[267361]: proxy-reject: END-OF-MESSAGE: 451 4.3.0 Error: queue file write error;

After this line all received emails record

Error: queue file write error

What could be causing this problem?
 
can you post the journal from the time the daemon crashed (maybe a few minutes/hours before and after too) and the output of pmgversion -v
 
Here is information I collected the first time the problem occurred.

journalctl -u pmg-smtp-filter -u postfix@- --since '2023-03-01 11:41:42' --until '2023-03-01 11:42:20'

Mar 01 11:41:43 mx1.spamcop.com.br postfix/anvil[219036]: statistics: max connection rate 13/60s for (smtpd:138.255.220.170) at Mar 1 11:35:14
Mar 01 11:41:43 mx1.spamcop.com.br postfix/anvil[219036]: statistics: max connection count 8 for (smtpd:138.255.220.170) at Mar 1 11:35:14
Mar 01 11:41:43 mx1.spamcop.com.br postfix/anvil[219036]: statistics: max cache size 82 at Mar 1 11:41:39


Mar 01 11:42:11 mx1.spamcop.com.br pmg-smtp-filter[267082]: fast exit to reduce server load (free 5267456 bytes)
Mar 01 11:42:23 mx1.spamcop.com.br systemd[1]: pmg-smtp-filter.service: A process of this unit has been killed by the OOM killer.
Mar 01 11:42:17 mx1.spamcop.com.br postfix/smtpd[267361]: warning: lost connection with proxy 127.0.0.1:10024
Mar 01 11:42:23 mx1.spamcop.com.br systemd[1]: pmg-smtp-filter.service: Killing process 241029 (pmg-smtp-filter) with signal SIGKILL.
Mar 01 11:42:17 mx1.spamcop.com.br postfix/smtpd[267361]: proxy-reject: END-OF-MESSAGE: 451 4.3.0 Error: queue file write error;

pmgversion -v

proxmox-mailgateway: 7.2-1 (API: 7.2-4/532fc47f, running kernel: 5.15.85-1-pve)
pmg-api: 7.2-4
pmg-gui: 3.2-2
pve-kernel-helper: 7.3-6
pve-kernel-5.15: 7.3-2
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
clamav-daemon: 0.103.8+dfsg-0+deb11u1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
libarchive-perl: 3.4.0-1
libjs-extjs: 7.0.0-1
libjs-framework7: 4.4.7-1
libproxmox-acme-perl: 1.4.4
libproxmox-acme-plugins: 1.4.4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-2
libpve-http-server-perl: 4.1-6
libxdgmime-perl: 1.0-1
lvm2: not correctly installed
pmg-docs: 7.2-1
pmg-i18n: 2.8-3
pmg-log-tracker: 2.3.2-1
postgresql-13: 13.9-0+deb11u1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-spamassassin: 3.4.6-5
proxmox-widget-toolkit: 3.5.5
pve-firmware: 3.6-3
pve-xtermjs: 4.16.0-1
 
please post the log from the timestamps i asked for:
can you post the journal from the time the daemon crashed (maybe a few minutes/hours before and after too) and the output of pmgversion -v
a single minute and truncated to postfix/etc. is not enough to debug this
 
Today

Mar 27 12:07:54 mx1 kernel: [1335377.053148] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/system.slice/clamav-daemon.service,task=clamd,pid=841922,uid=107
 
And what logs do you need to debug correctly?

I can collect, it just crashed once again.
the whole log from a bit before and after it happens e.g. with 'journalctl --since <datetime> --until <datetime>' and replace <datetime> with the start and end time of when you want to get the log
 
the log does not seem to contain the original error and there does not seem to be anything wrong with the server in this time frame...
 
Mar 27 11:16:18 mx1.spamcop.com.br pmg-smtp-filter[1085807]: 2023/03/27-11:16:18 Server closing!
Mar 27 11:16:18 mx1.spamcop.com.br systemd[1]: pmg-smtp-filter.service: Succeeded.
Mar 27 11:16:18 mx1.spamcop.com.br systemd[1]: Stopped Proxmox SMTP Filter Daemon.
Mar 27 11:16:18 mx1.spamcop.com.br systemd[1]: pmg-smtp-filter.service: Consumed 5h 37min 45.294s CPU time.
Mar 27 11:16:18 mx1.spamcop.com.br systemd[1]: Starting Proxmox SMTP Filter Daemon...

Mar 27 11:16:19 mx1.spamcop.com.br pmg-smtp-filter[1218381]: Process Backgrounded
Mar 27 11:16:19 mx1.spamcop.com.br pmg-smtp-filter[1218381]: 2023/03/27-11:16:19 main (type Net::Server::PreFork) starting! pid(1218381)
Mar 27 11:16:19 mx1.spamcop.com.br pmg-smtp-filter[1218381]: Binding to TCP port 10023 on host 127.0.0.1 with IPv4
Mar 27 11:16:19 mx1.spamcop.com.br pmg-smtp-filter[1218381]: Binding to TCP port 10024 on host 127.0.0.1 with IPv4

Mar 27 11:16:34 mx1.spamcop.com.br postfix/smtpd[1218834]: proxy-reject: END-OF-MESSAGE: 451 4.3.0 Error: queue file write error; from=<jessica.ventura@representapr.com.br> to=<fabiano@rasametalicas.com.br> proto=ESMTP helo=<painel.lojasamigao.com.br>



When the service restarts it seems to not be able to go up on its own anymore.

From this time onwards all emails are rejected with the proxy-reject message: END-OF-MESSAGE: 451 4.3.0 Error: queue file write error;
 
ok, so the service was restarted 2 times in short succesion? (why?) in any case the 451 error is only a temporary one, so the sending mail server should retry a bit later
the service has finished starting later again and it's just the one mail that was rejected because of this AFAICS
 
I repeat, when this stop occurs it obviously cannot restart the service. The log is now the time the service stopped rejecting all connections.

I want to understand what caused the crash, evidently the complaint is for lack of resources.

I increased the machine's ram to 16GB and I'm watching to see if it happens again. Although I believe that 8GB is more than enough for the number of configured domains and the volume of emails...

https://eth1.com.br/proxmox/log_novo.txt
 
Once again there was a crash and total unavailability of the service.

This time I was away and I didn't notice the drop, the service even comes back but it takes hours which causes a great inconvenience.

Follow the log link: https://www.transfernow.net/dl/20230418m9vUjlKx

The service went down between 18:00 and 19:00 on the 17th
And returned between 02:00 and 03:00 on the 18th

I already doubled the RAM from 8GB to 16GB so the problem is not lack of resources, unless it's the processor. Other users report this same problem both in a PMG group on telegram and right here on the forum, and all so far without a solution.

Any help is welcome.

Seleção_727.png
 
ok thanks for the full log output, this makes the picture much clearer what is happening:

in your case, the pmg-smtp-filter uses much more memory than is expected
we expect ~120 MiB of RSS use per pmg-smtp-filter and start many of them (to be able to parallelize the incoming mails)
in your case they use nearly 500MiB of RSS memory and then our calculation for the memory use does not work anymore
since the host has 16GiB of memory we'll start 40 of them (the maximum) which would work fine when they use 120MiB (would be a bit less than 5GiB)
but since they use 500MiB you'd need at least 20GiB of memory to handle that

so the question is, why do they use that much memory?
it should only use that much memory if you have a (too) big rule system. how much rules/objects/etc do you have configured?
 
I don't usually use many rules generated in the PMG interface, just the default ones and I don't feed the whitelists and blacklists through it, just the default ones and a disclaimer rule.

I feed block lists in files for both spamassassin and postfix, blocking emails, domains, subdomains and some ip ranges.

This has always worked well, the number of domains is the same as always, and I even used PMG with only 4GB of ram, without ever crashing. I realized that the crashes started in version 7.2.4 and since then I've been looking for a solution.

The lists I use are not very large, they have between 40,000 and 200,000 lines, which I have been feeding since an antispam solution prior to PMG.


Could I limit the number of processes so that the ram is not fully used? Despite tracking usage and never seeing it use up all the ram.
 
i mean we did recently update to spamassassin 4, no idea if thats reponsible in your case

but sadly no, for now there is way to limit the number of processes, but please file a feature request here: https://bugzilla.proxmox.com i think giving some control over how many processes are started would be good for cases such as yours
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!