[SOLVED] pmg-smtp-filter: Many instances/children running, each at 50% CPU; 6-core machine exhausted after 30-90 minutes

linux

Member
Dec 14, 2020
95
36
23
Australia
Hi there,

Weird one, forked from this other thread about a similar issue. @Stoiko Ivanov

pmg-smtp-filter after a while has many instances running, and with each taking about half a core, the machine is CPU-overloaded fairly quickly.

We updated 7.x branch to latest (on same final sub-major) and rebooted, plus added more CPU and RAM, then rebooted again, but after 30-90 minutes same thing.

Checked for custom template files and removed one then resynced configs and rebooted. Same thing came back. Updated to 8.0 and Bookworm, thought okay all good, but same thing is back.

Now with the raised core count, it is on average at 60% usage of 6 cores which for current usage (2 cores was fine for 2 years until 9pm last night or so) is too much. Mail processing time today when we logged in to investigate a mail flow issue report, was at 1,500 seconds. Now it is down to almost 100 seconds which is good.

We were trying to do smart things with sa_learn and so on, with being able to have end users forward spam-that-was-not-flagged/quarantined-as-spam to the system, for it to learn that as a spam sample. (side note, is this doable and if so, how?) I feel maybe there is something from that which may be at play.

Nothing is jumping out in the logs. Perhaps we should reboot again then get the timestamp when it begins to flare? Or should it be more obvious?

Images in the previous thread.

Thanks!
 
Can you please share the logs/journal from that node - redact only what you must ...
 
How much do you want? Just for a few hours last night, and for pmg-smtp-filter only?

If it's a lot of logging, tricky to redact parts - what would you like?

Code:
Oct 04 19:00:46 1st-gate freshclam[768]: Received signal: wake up
Oct 04 19:00:46 1st-gate freshclam[768]: ClamAV update process started at Wed Oct  4 19:00:46 2023
Oct 04 19:00:46 1st-gate freshclam[768]: Received signal: wake up
Oct 04 19:00:46 1st-gate freshclam[768]: ClamAV update process started at Wed Oct  4 19:00:46 2023
Oct 04 19:00:46 1st-gate freshclam[768]: WARNING: Your ClamAV installation is OUTDATED!
Oct 04 19:00:46 1st-gate freshclam[768]: WARNING: Local version: 0.103.8 Recommended version: 0.103.10
Oct 04 19:00:46 1st-gate freshclam[768]: DON'T PANIC! Read https://docs.clamav.net/manual/Installing.html
Oct 04 19:00:46 1st-gate freshclam[768]: daily.cld database is up-to-date (version: 27050, sigs: 2042162, f-level: 90, builder: raynman)
Oct 04 19:00:46 1st-gate freshclam[768]: main.cvd database is up-to-date (version: 62, sigs: 6647427, f-level: 90, builder: sigmgr)
Oct 04 19:00:46 1st-gate freshclam[768]: bytecode.cld database is up-to-date (version: 334, sigs: 91, f-level: 90, builder: anvilleg)
Oct 04 19:00:46 1st-gate freshclam[768]: Your ClamAV installation is OUTDATED!
Oct 04 19:00:46 1st-gate freshclam[768]: Local version: 0.103.8 Recommended version: 0.103.10
Oct 04 19:00:46 1st-gate freshclam[768]: DON'T PANIC! Read https://docs.clamav.net/manual/Installing.html
Oct 04 19:00:46 1st-gate freshclam[768]: daily.cld database is up-to-date (version: 27050, sigs: 2042162, f-level: 90, builder: raynman)
Oct 04 19:00:46 1st-gate freshclam[768]: main.cvd database is up-to-date (version: 62, sigs: 6647427, f-level: 90, builder: sigmgr)
Oct 04 19:00:46 1st-gate freshclam[768]: bytecode.cld database is up-to-date (version: 334, sigs: 91, f-level: 90, builder: anvilleg)
Oct 04 19:00:46 1st-gate freshclam[768]: --------------------------------------
Oct 04 19:00:46 1st-gate pmg-smtp-filter[3567349]: 2023/10/04-19:00:46 CONNECT TCP Peer: "[127.0.0.1]:40072" Local: "[127.0.0.1]:10023"
Oct 04 19:00:46 1st-gate pmg-smtp-filter[3511469]: Starting "1" children

Then when there are a few emails via pmg-smtp-filter there are children opened, then some house-keeping and it kills some.

If we filter journalctl to pmg-smtp-filter then it's very email-related so lots of senders/receivers.
 
A few hours ago nodes' loads were around 50-70% each. Now they are around 15% each...

If it keeps flaring, we will check on it again. Scope for journalctl would help us give info? :)
 
My guess is that there might be a particular mail, that causes either clamav or spamassassin to run for far too long - this will show in the logs as some kind of timeout ... - so if you still want to take a look - check the complete journal (definitely not only pmg-smtp-filter)
you can also share a few hours of the complete journal (redact only mail-domains and public ips) and maybe someone here finds something that stands out
 
  • Like
Reactions: linux
Thank you for the insights, that makes a lot of sense. I've been monitoring the systems since the resource increases, and they're OK.

Loads are around more normal levels, however it does seem to correlate to your hunch - sometimes volume does not change really, however it is crunching harder to accomplish the same amount. I suppose in those cases, that is where we should now be OK due to limits going up.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!