Saturday, March 27, 2010

Train Snow Leopard Spamassassin using shared IMAP folders

By default, Apple only supports training the Spamassassin bayesian spam filter for spam/ham by using two special useraccounts "junkmail" and "notjunkmail".
This might work for you.

But I for example use a shared imap account as a public folder for people in the company to share common emails.
The other day, I thought of a possibility to add two folders to this shared account (for SPAM/HAM emails) to learn spamassassin from these folders.
It works.
The only thing you need is:
  • Folders in a shared IMAP account for spam and ham emails (e.g. "SPAM_not_detected" and "SPAM_false_positive")
  • A script which learns from these emails and deletes them afterwards.
  • A launcher script and a plist file for it to start regularly.
Here it comes.
Save the following file to /usr/local/bin/train-spamassassin:

#!/bin/sh
# Script to Train Spamassassin SPAM/HAM emails in certain folders
# USE AT YOUR OWN RISK!

################################################################################
# SETUP
# Set to mail user which hold the spam/ham folders
MAIL_USER=exampleuser
# Folder for not detected Spams (relative)
SPAM_MAILDIR="SPAM_not_detected"
# Folder for not false positive copies (relative)
FALSE_POSITIVE_MAILDIR="SPAM_false_positive"
# Email address to send status email to. You need to enable mail sending below.
EMAILADDR=you@anything.xxxx
################################################################################
SA_LEARN_PATH=/usr/bin/sa-learn
DB_PATH=/var/amavis/.spamassassin
SA_LEARN_CMD="$SA_LEARN_PATH --dbpath $DB_PATH --no-sync"

# Remove remark from next line and the last line in script to enable email sending of status
#(
echo "################################################################"
echo "Spam Assassin Training Script $0"
echo "################################################################"
# determine users GUID
USER_GUID=`/usr/bin/cvt_mail_data -i "$MAIL_USER"`
if [ `expr "$USER_GUID" : '[0-9A-F]*-[0-9A-F]*-[0-9A-F]*-[0-9A-F]*-[0-9A-F]*'` -eq 0 ]
then
echo "Error: Can't find GUID of mail user $MAIL_USER"
echo $USER_GUID
exit 1
fi
# determine the mail store path
PART_MAP_PATH=/etc/dovecot/partition_map.conf
MAIL_STORE_PATH=`grep "^default:" $PART_MAP_PATH | sed s,default:,,`
case "$MAIL_STORE_PATH" in
/*) # OK
;;
*) echo "Can't determine mail store path from $PART_MAP_PATH"
exit 1
;;
esac
USER_MAIL_PATH="$MAIL_STORE_PATH/$USER_GUID"
SPAM_MAIL_PATH=$USER_MAIL_PATH/.$SPAM_MAILDIR
FALSE_POSITIVE_MAIL_PATH=$USER_MAIL_PATH/.$FALSE_POSITIVE_MAILDIR

if [ ! -d $SPAM_MAIL_PATH ]; then
echo "Error: Spam mail dir $SPAM_MAIL_PATH does not exist"
exit 1
fi
if [ ! -d $FALSE_POSITIVE_MAIL_PATH ]; then
echo "Error: False postives mail dir $FALSE_POSITIVE_MAIL_PATH does not exist"
exit 1
fi

echo " "
echo "################################################################"
echo "Learning SPAM Entries in $SPAM_MAIL_PATH"
echo "################################################################"

$SA_LEARN_CMD --spam $SPAM_MAIL_PATH/cur/
$SA_LEARN_CMD --spam $SPAM_MAIL_PATH/new/
count=`ls $SPAM_MAIL_PATH/cur/ | wc -l`
echo "There are $count mails in the folder"
if [ $count != 0 ]; then
echo "Deleting spam mails.."
find $SPAM_MAIL_PATH/cur/ -type f -print0 | xargs -0 rm
else
echo "No Mails to delete."
fi

################################################################################
# FALSE Positives
################################################################################
echo " "
echo "##########################################################"
echo "Learning False Positive Entries in $FALSE_POSITIVE_MAIL_PATH"
echo "###########################################################"
$SA_LEARN_CMD --ham $FALSE_POSITIVE_MAIL_PATH/cur/
$SA_LEARN_CMD --ham $FALSE_POSITIVE_MAIL_PATH/tmp/

count=`ls $FALSE_POSITIVE_MAIL_PATH/cur/ | wc -l`
if [ $count != 0 ]; then
echo "Deleting false positives.."
find $FALSE_POSITIVE_MAIL_PATH/cur/ -type f -print0 | xargs -0 rm
else
echo "No mails to delete."
fi

echo "syncing database.."
sudo $SA_LEARN_PATH --dbpath $DB_PATH --sync

echo "end of script."
##) | /usr/bin/mail -s "$HOSTNAME Spamassassin Training status" $EMAILADDR


Save this script as /usr/local/bin/train-spamassassin-runner:
#!/bin/sh
# run train-spamassassin. Needed for launchd
/usr/local/bin/train-spamassassin
stat=$?
if [ $stat = 0 ]; then
# everything is ok
syslog -s -l NOTICE "$0: train-spamassassin signaled success"
exit 0
else
syslog -s -l NOTICE "$0: train-spamassassin signaled error $stat"
exit $stat
fi


And of course a launchd plist file to start the script regularly.
Save this content to /Library/LaunchDaemons/org.roosbertl.spamassassinlearn.plist:


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"><plist version="1.0">
<!-- Start spamassassin sa-learn script every day at 6:33
24/03/10 Atomic.
-->
<dict>
<key>Label</key>
<string>org.roosbertl.spamassassinlearn</string>
<key>UserName</key>
<string>root</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/train-spamassassin-runner</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>StartCalendarInterval</key>
<array>
<dict>
<key>Hour</key>
<integer>6</integer>
<key>Minute</key>
<integer>0</integer>
</dict>
<dict>
<key>Hour</key>
<integer>12</integer>
<key>Minute</key>
<integer>0</integer>
</dict>
<dict>
<key>Hour</key>
<integer>18</integer>
<key>Minute</key>
<integer>0</integer>
</dict>
</array>
</dict>
</plist>

Then activate the start of the script:

sudo launchctl load /Library/LaunchDaemons/org.roosbertl.spamassassinlearn.plist

Now your users just have to move spam/ham email to the corresponding shared imap folders.