spamprobe

SPAMPROBE(1)              BSD General Commands Manual             SPAMPROBE(1)

NAME
     spamprobe — Spam detector using Bayesian analysis of word counts.

SYNOPSIS
     spamprobe [-a char] [-c] [-d directory] [-h] [-H option] [-m] [-n number]
               [-r number] [-s number] [-v] [-V] [-Y] [-7] [-8] command [...]
     spamprobe receive [filename ...]
     spamprobe score [filename ...]
     spamprobe find-spam [filename ...]
     spamprobe find-good [filename ...]
     spamprobe good [filename ...]
     spamprobe spam [filename ...]
     spamprobe remove [filename ...]
     spamprobe dump
     spamprobe export
     spamprobe import [filename ...]

DESCRIPTION
     Welcome to SpamProbe!  Are you tired of the constant bombardment of your
     inbox by unwanted email pushing everything from porn to get rich quick
     schemes?  Have you tried other spam filters but become disenchanted with
     them when you realized that their manually generated rule sets weren't
     updated fast enough to keep up with spammers wording changes?  Or that
     they generated unwanted false positive scores?

     SpamProbe operates on a different basis entirely.  Instead of using
     pattern matching and a set of human generated rules SpamProbe relies on a
     Bayesian analysis of the frequency of words used in spam and non-spam
     emails received by an individual person.  The process is completely
     automatic and tailors itself to the kinds of emails that each person
     receives.

   FEATURES
           Spam detection using Bayesian analysis of terms contained in
               each email.  Words used often in spams but not in good email
               tend to indicate that a message is spam.
           Written in C++ for good performance.  Database access using
               GDBM for quick startup and fast term count retrieval.
           Recognition and decoding of MIME attachments in quoted-
               printable and base64 encoding.  Automatically skips non-text
               attachments.
           Counts two word phrases as well as single words for higher
               precision.
           Ignores HTML tags in emails for scoring purposes unless the -h
               command line option is used.  Many spams use HTML and few
               humans do so HTML tends to become a powerful recognizer of
               spams.  However in the author's opinion this also substantially
               increases the likelihood of false positives if someone does
               send a non-spam email containing HTML tags.  SpamProbe does
               pull urls from inside of html tags however since those tend to
               be spammer specific.
           Locks mboxes and databases using fcntl file locking to avoid
               problems when multiple emails arrive simultaneously.
           Scores only the Received, Subject, To, From, and Cc headers.
               All other headers are ignored to make it hard for spammers to
               hide non-spammy words in X- headers to fool the filter.  The -H
               command line option can be used to override this.

   OPTIONS
     -a char       By default spamprobe converts non-ascii characters
                   (characters with the most significant bit set to 1) into
                   the letter 'z'.  This is useful for lumping all Asian
                   characters into a single word for easy recognition.  The -a
                   option allows you to change the character to something else
                   if you don't like the letter 'z' for some reason.

     -c            Create the database directory if it does not already exist.
                   Normally spamprobe exits with a usage error if the database
                   directory does not already exist.

     -d directory  By default spamprobe stores its database in a directory
                   named .spamprobe under your home directory.  The -d option
                   allows you to specify a different directory to use.  This
                   is necessary if your home directory is NFS mounted for
                   example.

     -h            By default spamprobe removes HTML markup from the text in
                   emails to help avoid false positives.  The -h option allows
                   you to override this behavior and force spamprobe to
                   include words from within HTML tags in its word counts.
                   Note that spamprobe always counts any URLs in hrefs within
                   tags whether -h is used or not.  Use of this option is
                   discouraged.  It can increase the rate of spam detection
                   slightly but unless the user receives a significant amount
                   of HTML emails it also tends to increase the number of
                   false positives.

     -H option     By default spamprobe only scans a meaningful subset of
                   headers from the email message when searching for words to
                   score.  The -H option allows the user to specify additional
                   headers to scan. Legal values are "all", "nox", or
                   "normal".  "all" scans all headers, "nox" scans all headers
                   except those starting with X-, and "normal" scans the
                   normal set of headers.

     -m            Use mbox format for reading emails in receive mode.
                   Normally spamprobe assumes that the input to receive mode
                   contains a single message so it doesn't look for message
                   breaks.

     -n number     Changes the number of most significant words/phrases used
                   by spamprobe to calculate the score for each message.
                   Generally this is changed only for optimization purposes.

     -r number     Changes the number of times that a single word/phrase can
                   occurr in the top words array used to calculate the score
                   for each message.  Allowing repeats reduces the number of
                   words overall (since a single word occupies more than one
                   slot) but allows words which occur frequently in the
                   message to have a higher weight. Generally this is changed
                   only for optimization purposes.

     -s number     spamprobe maintains an in memory cache of the words it has
                   seen in previous messages to reduce disk i/o and improve
                   performance.  By default the cache is flushed and cleared
                   every 250 messages.  This number can be changed using the
                   -s option.  A value of zero causes to use 100,000 as the
                   limit which effectively means that the cache will only be
                   flushed at program exit (unless you have really enormous
                   mailbox files).  The cache doesn't affect receive, dump, or
                   export but has a significant impact on the others.

     -v            Write debugging information to stderr.  This can be useful
                   for debugging or for seeing which terms spamprobe used to
                   score each email.

     -V            Prints version and copyright information and then exits.

     -Y            Assume traditional Berkeley mailbox format, ignoring any
                   Content-Length: fields.

     -7            Ignore any characters with the most significant bit set to
                   1 instead of mapping them to the letter 'z'.

     -8            Store all characters even if their most significant bit is
                   set to 1.

   COMMANDS
     receive [filename ...]    Tells spamprobe to read its standard input (or
                               a file specified after the receive command) and
                               score it using the current databases.  Once the
                               message has been scored the message is
                               classified as either spam or non-spam and its
                               word counts are written to the appropriate
                               database.  The message's score is written to
                               stdout along with a single word.  For example:

                                     SPAM 0.99

                               or

                                     GOOD 0.02

     score [filename ...]      Similar to receive except that the databases
                               are not modified in any way and only the score
                               is printed to stdout.

     find-spam [filename ...]  Similar to score except that it prints a short
                               summary and score for each message that is
                               determined to be spam.  This can be useful when
                               testing.

     find-good [filename ...]  Similar to score except that it prints a short
                               summary and score for each message that is
                               determined to be good.  This can be useful when
                               testing.

     good [filename ...]       Scans each file (or stdin if no file is
                               specified) and reclassifies every email in the
                               file as non-spam.  The databases are updated
                               appropriately.  Previously processed messages
                               (recognized using their message ids) are
                               ignored.

     spam [filename ...]       Scans each file (or stdin if no file is
                               specified) and reclassifies every email in the
                               file as spam.  The databases are updated
                               appropriately.  Previously processed messages
                               (recognized using their message ids) are
                               ignored.

     remove [filename ...]     Scans each file (or stdin if no file is
                               specified) and removes its term counts from the
                               database.  Messages which are not in the
                               database (recognized using their message ids)
                               are ignored.

     dump                      Prints the contents of the word counts database
                               one word per line in human readable format with
                               good count, spam count, and word in columns
                               separated by whitespace.  Note that when using
                               GDBM for the database the words are printed in
                               the order they are hashed so the results will
                               need to be sorted to be most useful.  The
                               standard unix sort command can do this.  For
                               example to list all words from "most good" to
                               "least good" use this command:

                                     spamprobe dump | sort -k 1 -n -r

                               To list all words from "most spammy" to "least
                               spammy" use this command:

                                     spamprobe dump | sort -k 2 -n -r

     export                    Similar to the dump command but prints the
                               counts and words in a comma separated format
                               with the words surrounded by double quotes.
                               This can be more useful for importing into some
                               databases.

     import [filename ...]     Reads the specified files which must contain
                               export data written by the export command.  The
                               terms and counts from this file are added to
                               the database.  This can be used to convert a
                               database from a prior version.

ENVIRONMENT
     The spamprobe command looks for the database directory in the users home
     directory specified by the HOME environment variable.  Use the -d flag to
     specify a different database directory.

FILES
     $HOME/.spamprobe  The default database directory.

EXAMPLES
     Typically one would use spamprobe with procmail and formail to flag and
     filter incoming email.

           # SpamProbe rule.
           :0
           {
               # Generate a score for the message.
               SCORE=`spamprobe receive`
               # Add a X-SpamProbe header to the message.
               :0 fhW
               | formail -I "X-SpamProbe: $SCORE"
           }

           # Filter matching messages to their own mailbox.
           :0:
           *^X-SpamProbe: SPAM
           spamprobe

DIAGNOSTICS
     Exit status is 0 on success, and 1 if spamprobe encounters an invalid
     command.

COMPATIBILITY
     Version of spamprobe previous to 0.7 use a different database format.  To
     convert your existing database to the new format use the following
     command.

           spamprobe-export_0.6 | spamprobe import

SEE ALSO
     formail(1), procmail(1),

     Paul Graham, A Plan for Spam, August 2002,
     http://www.paulgraham.com/spam.html.

AUTHORS
     This manual page was written by Matthew N. Dodd <mdodd@FreeBSD.org>.
     spamprobe was written by
     Brian Burton <bburton@users.sourceforge.net>

BSD                            September 5, 2002                           BSD