[SOLVED] pmg-smtp-filter: Many instances/children running, each at 50% CPU; 6-core machine exhausted after 30-90 minutes

linux · Oct 5, 2023

Hi there,

Weird one, forked from this other thread about a similar issue. @Stoiko Ivanov

pmg-smtp-filter after a while has many instances running, and with each taking about half a core, the machine is CPU-overloaded fairly quickly.

We updated 7.x branch to latest (on same final sub-major) and rebooted, plus added more CPU and RAM, then rebooted again, but after 30-90 minutes same thing.

Checked for custom template files and removed one then resynced configs and rebooted. Same thing came back. Updated to 8.0 and Bookworm, thought okay all good, but same thing is back.

Now with the raised core count, it is on average at 60% usage of 6 cores which for current usage (2 cores was fine for 2 years until 9pm last night or so) is too much. Mail processing time today when we logged in to investigate a mail flow issue report, was at 1,500 seconds. Now it is down to almost 100 seconds which is good.

We were trying to do smart things with sa_learn and so on, with being able to have end users forward spam-that-was-not-flagged/quarantined-as-spam to the system, for it to learn that as a spam sample. (side note, is this doable and if so, how?) I feel maybe there is something from that which may be at play.

Nothing is jumping out in the logs. Perhaps we should reboot again then get the timestamp when it begins to flare? Or should it be more obvious?

Images in the previous thread.

Thanks!

Stoiko Ivanov · Oct 5, 2023

Can you please share the logs/journal from that node - redact only what you must ...

linux · Oct 5, 2023

How much do you want? Just for a few hours last night, and for pmg-smtp-filter only?

If it's a lot of logging, tricky to redact parts - what would you like?

Code:

Oct 04 19:00:46 1st-gate freshclam[768]: Received signal: wake up
Oct 04 19:00:46 1st-gate freshclam[768]: ClamAV update process started at Wed Oct  4 19:00:46 2023
Oct 04 19:00:46 1st-gate freshclam[768]: Received signal: wake up
Oct 04 19:00:46 1st-gate freshclam[768]: ClamAV update process started at Wed Oct  4 19:00:46 2023
Oct 04 19:00:46 1st-gate freshclam[768]: WARNING: Your ClamAV installation is OUTDATED!
Oct 04 19:00:46 1st-gate freshclam[768]: WARNING: Local version: 0.103.8 Recommended version: 0.103.10
Oct 04 19:00:46 1st-gate freshclam[768]: DON'T PANIC! Read https://docs.clamav.net/manual/Installing.html
Oct 04 19:00:46 1st-gate freshclam[768]: daily.cld database is up-to-date (version: 27050, sigs: 2042162, f-level: 90, builder: raynman)
Oct 04 19:00:46 1st-gate freshclam[768]: main.cvd database is up-to-date (version: 62, sigs: 6647427, f-level: 90, builder: sigmgr)
Oct 04 19:00:46 1st-gate freshclam[768]: bytecode.cld database is up-to-date (version: 334, sigs: 91, f-level: 90, builder: anvilleg)
Oct 04 19:00:46 1st-gate freshclam[768]: Your ClamAV installation is OUTDATED!
Oct 04 19:00:46 1st-gate freshclam[768]: Local version: 0.103.8 Recommended version: 0.103.10
Oct 04 19:00:46 1st-gate freshclam[768]: DON'T PANIC! Read https://docs.clamav.net/manual/Installing.html
Oct 04 19:00:46 1st-gate freshclam[768]: daily.cld database is up-to-date (version: 27050, sigs: 2042162, f-level: 90, builder: raynman)
Oct 04 19:00:46 1st-gate freshclam[768]: main.cvd database is up-to-date (version: 62, sigs: 6647427, f-level: 90, builder: sigmgr)
Oct 04 19:00:46 1st-gate freshclam[768]: bytecode.cld database is up-to-date (version: 334, sigs: 91, f-level: 90, builder: anvilleg)
Oct 04 19:00:46 1st-gate freshclam[768]: --------------------------------------
Oct 04 19:00:46 1st-gate pmg-smtp-filter[3567349]: 2023/10/04-19:00:46 CONNECT TCP Peer: "[127.0.0.1]:40072" Local: "[127.0.0.1]:10023"
Oct 04 19:00:46 1st-gate pmg-smtp-filter[3511469]: Starting "1" children

Then when there are a few emails via pmg-smtp-filter there are children opened, then some house-keeping and it kills some.

If we filter journalctl to pmg-smtp-filter then it's very email-related so lots of senders/receivers.

linux · Oct 6, 2023

A few hours ago nodes' loads were around 50-70% each. Now they are around 15% each...

If it keeps flaring, we will check on it again. Scope for journalctl would help us give info?

Stoiko Ivanov · Oct 6, 2023

My guess is that there might be a particular mail, that causes either clamav or spamassassin to run for far too long - this will show in the logs as some kind of timeout ... - so if you still want to take a look - check the complete journal (definitely not only pmg-smtp-filter)
you can also share a few hours of the complete journal (redact only mail-domains and public ips) and maybe someone here finds something that stands out

linux · Oct 20, 2023

Thank you for the insights, that makes a lot of sense. I've been monitoring the systems since the resource increases, and they're OK.

Loads are around more normal levels, however it does seem to correlate to your hunch - sometimes volume does not change really, however it is crunching harder to accomplish the same amount. I suppose in those cases, that is where we should now be OK due to limits going up.

[SOLVED] pmg-smtp-filter: Many instances/children running, each at 50% CPU; 6-core machine exhausted after 30-90 minutes

linux

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

linux

Well-Known Member

linux

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

linux

Well-Known Member

We value your privacy