Bayes Still not filtering emails after lots of spam received

migquebec · Jul 25, 2020

Hi, I am new to PMG and I looked through the forums before posting. I cannot see any bayes filter in my received emails. The auto-learning is on in the GUI but does not seem to work.

my local.cf looks like:
---------------------------------------------------------------------------
# dont use things by default
use_bayes 0
bayes_auto_expire 0
bayes_learn_to_journal 1

ok_languages all

envelope_sender_header X-Proxmox-Envelope-From

# use fast lock (non-nfs save)
lock_method flock

use_bayes 1

include /usr/share/spamassassin-extra/KAM.cf
---------------------------------------------------------------------------

I ahve not modified the "local.cf" file. I am confused as to why the "use bayes=" is present at 2 places. Does it comes from the template?

Also, here are some headers for example, in an email I have received that is clearly SPAM:

X-SPAM-LEVEL: Spam detection results: 1
AC_BR_BONANZA 0.001 Too many newlines in a row... spammy template
FROM_EXCESS_BASE64 0.001 From: base64 encoded unnecessarily
FROM_LOCAL_NOVOWEL 0.5 From: localpart has series of non-vowel letters
HTML_IMAGE_RATIO_08 0.001 HTML has a low ratio of text to image area
HTML_MESSAGE 0.001 HTML included in message
KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
MIME_HTML_ONLY 0.1 Message only has text/html MIME parts
MSGID_FROM_MTA_HEADER 0.001 Message-Id was added by a relay
RDNS_NONE 1.274 Delivered to internal network by a host with no rDNS
SPF_HELO_PASS -0.001 SPF: HELO matches SPF record
SPF_PASS -0.001 SPF: sender matches SPF record

I use PMG in a cluster as a relay/filter between Internet and my internal mail server. I have whitelisted only the local IP range so my own server can send emails without being filtered.

The setup run since July 3rd 2020. so far I got 1015 emails with a 10+ score.

in the statistics, for this month, a total of 281 000 emails came through PMG, I can see 56 000 Junk mails, 10 466 spams.

I read I could use sa-learn to train bayes but I have a hard time getting the emails since they are on a Windows based mail server. (I cannot mount a partition or something like that.

Any ideas?

Thanks a lot in advanced.

migquebec · Jul 28, 2020

Sorry for this new question. I am fairly new to SpamAssassin, Does anyone knows if there is a log specific to BAYES ou SpamAssasin where I could troubleshoot this. I can't even be sure is BAYE Auto-Learn is really active... (except the mail log)

Stoiko Ivanov · Jul 28, 2020

Check out the spamassassin documentation on Bayes and autolearning - that should provide a good starting point:
https://cwiki.apache.org/confluence/display/SPAMASSASSIN/BayesInSpamAssassin
https://cwiki.apache.org/confluence/display/SPAMASSASSIN/AutolearningNotWorking

I hope this helps!

migquebec · Jul 28, 2020

Thanks Stoiko, I will look at it. but can you tell me if it's enabled from my config. In the Proxmox gui, Bayes auto learn is enabled, but it does not works... I am a bit confused.

Thank again for the links.

heutger · Aug 5, 2020

migquebec said:
Thanks Stoiko, I will look at it. but can you tell me if it's enabled from my config. In the Proxmox gui, Bayes auto learn is enabled, but it does not works... I am a bit confused.

Thank again for the links.

@Stoiko Ivanov The links look nice, however, need to be integrated with your SpamAssassin adoption in PMG. Also I'm wondering (as I recently also thought about such an option), if that really works well, as the forwarded spam may contain also my header data from my system, where I forward the spam from. I recently tried on how to extract just the attachment and sending the plain spam as attachment to another box. Maybe PMG could support (e.g. via outbound filtering or an extra administrative address on PMG) such train way on forwarding non-seen spam to PMG.

However, the issue, the user may have here is the same problem, I saw: Autolearning and bayes is only working if you reach the count of 200 spam and 200 ham. Meanwhile 200 ham is easy to catch, 200 spam (really autolearned) is hard to reach (somehow on a two and a half year commercial test setup of an internet company using really much mails I never hit the limit yet). Until you're able to reach the limit, bayes is useless. On my private installation, I learned manually and bayes is working well.

migquebec · Sep 14, 2020

I was able to train SA with what remains in the quarantine folders, but I still get lots of spams that goes through. I noticed the mail.log give a lot of those:

The SA Score changes, but the bayes=undefined, autolearn are still at "no"...
SA score=5/5 time=5.637 bayes=undefined autolearn=no autolearn_force=no

I have more than 200 messages that were identified as SPAM and HAM:
sa-learn --dump magic
3 0 non-token data: bayes db version
3734 0 non-token data: nspam
30985 0 non-token data: nham
229754 0 non-token data: ntokens
1598864671 0 non-token data: oldest atime
1600113935 0 non-token data: newest atime
1600113938 0 non-token data: last journal sync atime
1600068373 0 non-token data: last expiry atime
172800 0 non-token data: last expire atime delta
7578 0 non-token data: last expire reduction count

I have some emails were I see in the BAYES score in the quarantine some not... I don't quite understand why.

I'm still confused. and Since I cannot read the emails on my mail server directly (hosted on Windows) it's a bit hard to have PMG train on mails that went through.

Any suggestions?

heutger · Sep 15, 2020

migquebec said:
I was able to train SA with what remains in the quarantine folders, but I still get lots of spams that goes through. I noticed the mail.log give a lot of those:

The SA Score changes, but the bayes=undefined, autolearn are still at "no"...
SA score=5/5 time=5.637 bayes=undefined autolearn=no autolearn_force=no

I have more than 200 messages that were identified as SPAM and HAM:
sa-learn --dump magic
3 0 non-token data: bayes db version
3734 0 non-token data: nspam
30985 0 non-token data: nham
229754 0 non-token data: ntokens
1598864671 0 non-token data: oldest atime
1600113935 0 non-token data: newest atime
1600113938 0 non-token data: last journal sync atime
1600068373 0 non-token data: last expiry atime
172800 0 non-token data: last expire atime delta
7578 0 non-token data: last expire reduction count

I have some emails were I see in the BAYES score in the quarantine some not... I don't quite understand why.

I'm still confused. and Since I cannot read the emails on my mail server directly (hosted on Windows) it's a bit hard to have PMG train on mails that went through.

Any suggestions?

autolearn may still stay at no, although you trained bayes, because the level to be reached for autolearning keeps the same, either you learned the bayes database manual or not. For sure, that makes sense not to learn any messages, which are just possible spam, but also doesn't make sense on bayes scoring, as then just "hard spam" may be considered as spam and every layer between will be missed. However, if you learned manually and reached the 200 message limit, the messages should also have BAYES_xxx scores meanwhile xxx shows the spam probability. Just based on this one additional scores are added or subtracted.

However, bayes is the last "fine-tuning" and barrier in the back. Before bayes you should establish a good set of blacklists, additional checks like Pyzor, DCC, maybe GeoIP, additional rulesets like Heinlein, Schaal (if you're getting german messages, otherwise you need to check similar lists for your language zone), faster updated KAM rules and additional no-KAM rules, maybe HashBL and similar and last but not least bayes optimization.

Search

Search

Bayes Still not filtering emails after lots of spam received

migquebec

Member

migquebec

Member

Stoiko Ivanov

Proxmox Staff Member

migquebec

Member

heutger

Famous Member

migquebec

Member

heutger

Famous Member