Train Spam/Ham PMG

thiagotgc

Active Member
Dec 17, 2019
151
22
38
37
I was thinking...

Would it be possible to create a SPAM account and a HAM within PMG, and send those accounts, emails and make PMG learn?

I see that current learning seems to me to be weak and bad or almost nothing !!

It would be extremely important to be able to teach the system.
Including false positives.
 
  • Like
Reactions: flames
This is currently not implemented and all our testing with this shows no big improvements. There are workarounds and reports from other users, just search the forum.
 
Is there a possibility that this will be part of the PMG officially?

Code:
--------------------------------------------
The basic idea is to make sure that spamassasin auto-train based on the spam messages that users move in "Junk Folder" from e-mail client (eg. Zimbra).

The flow:
- each user populates his "Junk Folder" directly through the e-mail client
- the e-mail client or the e-mail server, forwards messages to the respective addresses spam / ham to PMG
- messages are parked on maildir format in /home/USER/Maildir/new
- PMG, every day, run the application /etc/cron.daily/caricaspamham that performs sa-learn with the '--no -sync' parameter and import messages from /home/USER/Maildir/new
- PMG, every hour, run the application /etc/cron.hourly/proxmox, which takes care of synchronizing the contents of the file with the db /root/.spamassassin/bayes_journal


How to continue:
- change the content of the file "/var/lib/proxmox/templates/main.cf.in" to manage postfix maildir instead of mailbox

# Modifiche apportate per gestire il formato maildir
home_mailbox = Maildir/
mailbox_command =

- apply changes

proxconfig -s

- reload postfix configuration

service postfix reload

- add a new user to manage spam and one to handle the ham

adduser sa-spam-7893ddfg44hyh --disabled-login --shell /bin/false
adduser sa-ham-34545ghf4r77jh --disabled-login --shell /bin/false

- create two folders in which you want to move the messages analyzed using sa-learn to remove them, using parameter --forget, in case you find behavior "strange" side dansguardian

mkdir /home/sa-spam-7893ddfg44hyh/analizzati
mkdir /home/sa-ham-34545ghf4r77jh/analizzati

- send a test email to ensure that they are created their folders and check the contents

echo "testo della mail" | mail -s "Soggetto Mail" sa-spam-7893ddfg44hyh@antispam.levico.locale
echo "testo della mail" | mail -s "Soggetto Mail" sa-ham-34545ghf4r77jh@antispam.levico.locale

How to check if the filter is applied correctly ?

sa-learn --dump magic

Contents of the file "/opt/localbin/sa-wrapper.pl"

#!/usr/bin/perl -w
# Time-stamp: <05 April 2004, 13:37 home>
#
# sa-wrapper.pl
#
# SpamAssassin sa-learn wrapper
# (c) Alexandre Jousset, 2004
# This script is GPL'd
#
# Thanks to: Chung-Kie Tung for the removal of the dir
# Adam Gent for bug report
#
# v1.2

use strict;
use MIME::Tools;
use MIME::parser;

my $DEBUG = 0;
my $UNPACK_DIR = '/tmp';
my $SA_LEARN = '/usr/bin/sa-learn';

my ($spamham) = @ARGV;

sub recurs
{
my $ent = shift;

if ($ent->head->mime_type eq 'message/rfc822') {
if ($DEBUG) {
unlink "/tmp/spam.log.$$" if -e "/tmp/spam.log.$$";
open(OUT, "|$SA_LEARN -D --$spamham --no-sync >>/tmp/spam.log.$$ 2>&1") or die "Cannot pipe $SA_LEARN: $!";
} else {
open(OUT, "|$SA_LEARN --$spamham --no-sync") or die "Cannot pipe $SA_LEARN: $!";
}

$ent->bodyhandle->print(\*OUT);

close(OUT);
return;
}

my @parts = $ent->parts;

if (@parts) {
map { recurs($_) } @parts;
}
}

if ($DEBUG) {
MIME::Tools->debugging(1);
open(STDERR, ">/tmp/spam_err.log");
}
my $parser = new MIME::parser;
$parser->extract_nested_messages(0);
$parser->output_under($UNPACK_DIR);

my $entity;
eval {
$entity = $parser->parse(\*STDIN);
};

if ($@) {
die $@;
} else {
recurs($entity);
}

$parser->filer->purge;
rmdir $parser->output_dir;


Contents of the file "/etc/cron.daily/caricaspamham"

#!/bin/sh

FILESPAM=/home/sa-spam-7893ddfg44hyh
FILEHAM=/home/sa-ham-34545ghf4r77jh
WRAPPERFILE=/opt/localbin/sa-wrapper.pl

# impostare il valore a 1 per rimuovere i file dopo che sa-learn ha analizzato i messaggi di posta elettronica
SPOSTA_FILE=1

if ls ${FILESPAM}/Maildir/new/* >/dev/null 2>&1; then
for f in ${FILESPAM}/Maildir/new/*
do
echo "learning spam via ${f}...";
cat ${f} | ${WRAPPERFILE} spam
if [ "$SPOSTA_FILE" -eq 1 ]; then
mv ${f} ${FILESPAM}/analizzati
fi
done
fi

if ls ${FILEHAM}/Maildir/new/* >/dev/null 2>&1; then
for f in ${FILEHAM}/Maildir/new/*
do
echo "learning ham via ${f}...";
cat ${f} | ${WRAPPERFILE} ham
if [ "$SPOSTA_FILE" -eq 1 ]; then
mv ${f} ${FILEHAM}/analizzati
fi
done
fi

exit 0;
--------------------------------------------
 
As we do not see big improvements in spam detection with this method, we do not see reasons to implement it.

But we are open to accept patches if you want to implement something. Join our pmg-devel mailing list.
 
  • Like
Reactions: thiagotgc
Add to /etc/pmg/templates/local.cf.in

use_bayes_rules 1
bayes_auto_learn 1
bayes_auto_learn_threshold_nonspam -0.001
bayes_auto_learn_threshold_spam 4.0

Do you believe it will improve learning?

------
Yes, I would like to be on that list!
 
Since it is not possible to teach Spamassassin with PMG, would it be possible when "Delete" a message in Quarantine does PMG learn that it is SPAM, and when "Send" a message in Quarantine, does PMG learn that it is NOT SPAM?
 
  • Like
Reactions: flames
Hello, would like to add here...
You did a great job! We are evaluating PMG for enterprise use and we are looking forward to implement PMG in our infrastructure!
I must say, false positives are very little, but false negatives are very high.
From my side, learning bayes or implementing rspamd are important features in international business. Please, implement one of this. We have different business areas/countries, whose spammers just "slip" through the default filters (even in a non productive environment/domain).
Thanks in advance
 
Since it is not possible to teach Spamassassin with PMG, would it be possible when "Delete" a message in Quarantine does PMG learn that it is SPAM, and when "Send" a message in Quarantine, does PMG learn that it is NOT SPAM?
Could be an approch, but will only work for users that use the spam quarantine on PMG, so not any benefits for all, that do just tagging the SPAM.
 
We use SA autolearn mode, so only classified mails are used for learning.

I currently run two installations: One with manually bayes learning and one with autolearn. The manually learning works fine and bayes scores help especially if mails are somehow between the spam scores. The autolearn does not work, as now more than two and a half year later not enough spam has been autolearned and that's why bayes is not running. Ham are enough, but not spam. So it's somehow useless for "one company doing internet business (with a reasonable volume of mail" installations. In quarantine an extra button for learning would be a good start as if someone uses the quarantine (I don't do so), spam is already manually reviewed and this process can be used to improve filtering. Better would be like rspamd to provide upload files to train spam and ham directly to the system.

Just to show the values: My private installation runs with a reject score of 5 and has just a hand full false-positives. My commercial test installations runs with a reject score of 10 and has reasonable false-positives. So I could not lower the value, as bayes qualifying is missing.

If looking at the quality of spam detection, it's this path:
1. a good set of blacklists
2. Pyzor and DCC additional to Razor
3. bayes
4. additional rules, faster KAM updates, ...
5. additional settings like GeoIP, HashBL, ...

So bayes isn't the first one in row, but if looking for content filtering the first one in row.
 
classified emails? how do I classify an email received as Spam that was not identified by PROXMOX?

You can't. There are SpamAssassin internal rules (you can find them via Google) which say at a minimum count of score in the header and a minimum count of score in the body the mail will be auto-learned as spam. That's exactly some users are asking for:

1. (I won't need BTW, as I don't use quarantine) => learn via quarantine (as especially clicked on release as ham or delete as spam)
2. Provide learning forms like rspamd does to learn spam or ham
3. Provide possibilities to provide spam or ham to PMG like internal mail boxes and forward spam or ham to them or imap accounts to be integrated with your mail solution to move spam or ham there
4. Integrate PMG with a mail server system provided by Proxmox, which also allow to directly learn spam or ham

Ordered in easier to be implemented (and maybe probability to get such solution ;-))
 
  • Like
Reactions: flames
Provide learning forms like rspamd does to learn spam or ham
Provide possibilities to provide spam or ham to PMG like internal mail boxes and forward spam or ham to them
This makes plenty of sense. SpamExperts (back when they had decent owners) had an Outlook plug-in, and a mailbox you could forward to (which understood forwards, thus only interpreted the data that mattered). PMG is effective though it does feel like it's tricky to get up & off the ground.

Integrate PMG with a mail server system provided by Proxmox, which also allow to directly learn spam or ham
I don't think this makes as much sense. There are many mail servers out there, and a Gateway/Filter makes more sense (& is easier to implement) than an Email solution overall. This way they can focus on Filtering & Protection as a whole, without the woes that come with a full email system.

We already use bayes with autolearning mode.
Sadly while this is great for high-scoring spam, low-scoring spam is becoming more prevalent where many factors are avoided. They engineer the spam to trip less rules & Bayes by default isn't effective at making determinations. If you Quarantine at score 3, Bayes opting for -2 makes it ham.

Are there methods Proxmox are looking at to crowd-source data from PMG deployments? This would assist with identifying and more accurately scoring the more cautiously crafted Spam. You could anonymise the Spam, the primary goal would be to have a more accurate system at the lower end of the scoring spectrum. At the moment it feels like there's a void and it takes a bit of work to get your system up to scratch in that regard.

Separately - when you do a config export from PMG, does that include all the efforts you've put into sa-learn?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!