bogofilter

BOGOFILTER(1)                                                    BOGOFILTER(1)



NAME
       bogofilter - fast Bayesian spam filter

SYNOPSIS
       bogofilter [help options | classification options | registration
                  options] [algorithm options] [general options]


       where


       help options are:

        [-V] [-Q]


       classification options are:

        [-e] [-t] [-u] [-2] [-3] [-M] [-b] [-B filename ...] [-F] [-R] [algorithm
        options] [general options] [parameter options]


       registration options are:

        | -n] [-S | -N] [algorithm options] [general options]


       general options are:

        filename] [-C] [-d dir] [-l] [-L tag] [-I filename] [-O filename]



       algorithm options are:

        | -r | -f]


       parsing options are:

        [-Ph/-PH] [-Pt/-PT]


       parameter options are:

        value ,value] [-o value ,value]


       info options are:

        [-v] [-y date] [-D] [-x flags]


DESCRIPTION
       Bogofilter is a Bayesian spam filter. In its normal mode of operation,
       it takes an email message or other text on standard input, does a
       statistical check against lists of "good" and "bad" words, and returns
       a status code indicating whether or not the message is spam. Bogofilter
       is designed with fast algorithms, uses the Berkeley DB for fast startup
       and lookups, coded directly in C, and tuned for speed, so it can be
       used for production by sites that process a lot of mail.


THEORY OF OPERATION
       Bogofilter treats its input as a bag of tokens. Each token is checked
       against "good" and "bad" wordlists, which maintain counts of the
       numbers of times it has occurred in non-spam and spam mails. These
       numbers are used to compute the probability that a mail in which the
       token occurs is spam. After probabilities for all input tokens have
       been computed, a fixed number of the probabilities that deviate
       furtherest from average are combined using Bayes's theorem on
       conditional probabilities. If the computed probability that the input
       is spam exceeds a cutoff determined at compile time (currently 0.95,
       for the Robinson-Fisher algorithm), bogofilter returns 0, otherwise 1.


       While this method sounds crude compared to the more usual pattern-
       matching approach, it turns out to be extremely effective. Paul
       Graham's paper A Plan For Spam: http://www.paulgraham.com/spam.html is
       recommended reading.


       This program substantially improves on Paul's proposal by doing smarter
       lexical analysis. In particular, hostnames and IP addresses are
       retained as recognition features rather than broken up. Various kinds
       of MTA cruft such as dates and message-IDs are discarded so as not to
       bloat the word lists. Lex's Swiss-army-knife nature rises again.


       Another seeming improvement is that this program offers Gary Robinson's
       suggested modifications (S and f(w) but not g(w)) to the calculations.
       These modifications are described in Robinson's paper Spam Detection:
       http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html.


       Since then, Robinson and others have realized that the S calculation
       can be further optimized: if a vector of length k contains random,
       uniformly-distributed probabilities p, then -2 * sum(ln(p)) is
       distributed as chi-squared with 2n degrees of freedom. This is believed
       to be the most sensitive test of the hypothesis that the vector of
       probabilities is, in fact, uniformly distributed. Bogofilter now offers
       the option of applying this test (known as Fisher's method) to yield
       P(spam) and P(not spam), and using the difference as the "spamicity"
       score.


       The input may be one message or many. Messages are broken up on
       "From " lines. The algorithm is relatively insensitive to message
       miscounts.


OPTIONS
       Without command-line options, bogofilter returns 1 if the message is
       non-spam, 0 if it is spam. The non-spam wordfile is created if absent.


       HELP OPTIONS


       The -h option prints the help message and exits.


       The -V option prints the version number and exits.


       The -Q (query) option prints bogofilter's configuration, i.e.
       registration parameters, parsing options, bogofilter directory, etc.


       CLASSIFICATION OPTIONS


       The -p (passthrough) option writes a copy of the input mail to the
       output with an X-Bogosity header (in the style of SpamAssassin)
       inserted. The header will begin with "Yes" or "No" according as the
       mail is judged to be spam or non-spam respectively. Note: the memory
       consumption depends on whether the input file is regular and allows
       seek operations. Within these constraints, the file will be rewound and
       read a second time, without using much memory. If the input file
       however is not regular (for example, a pipeline or socket), then
       bogofilter will cache a copy if the entire mail in memory.


       The -e (embed) option tells bogofilter to exit with code 0 even if the
       mail is not spam. This simplifies using bogofilter from procmail or
       maildrop.


       The -t (terse) option tells bogofilter to print an abbreviated
       spamicity message containing 1 letter and the score. The letter will be
       "Y" to indicate spam and "N" to indicate non-spam.


       The -u option tells bogofilter to register the message's text after
       classifying it as spam or non-spam. A spam message will be registered
       on the spamlist and a non-spam message on the goodlist. If using the
       Robinson-Fisher method and the classification is "unsure", the message
       will not be registered. Effectively this option runs bogofilter with
       the -s or -n flag, as appropriate. (Caution is urged in the use of this
       capability, as any classification errors bogofilter may make will be
       preserved and accumulated until corrected with the -Sn and -Ns option
       combinations.)


       The -2 option tells bogofilter to binary classify the message as either
       ham or spam, and never as unsure. When this option is used with -u, a
       wordlist is always updated.


       The -3 option tells bogofilter to use tristate classification for the
       message, i.e. classify the message as ham, spam, or unsure. This option
       is effective only if ham_cutoff is non-zero.


       The -M option tells bogofilter to process its input as a mbox formatted
       file. If the -v or -t option is also given, a spamicity line will be
       printed for each message.


       The -b (streaming bulk mode) option tells bogofilter to classify
       multiple messages whose names are read from stdin. If the -v or -t
       option is also given, bogofilter will print a line giving file name and
       classification information for each file.


       The -Bfilename (bulk mode) option tells bogofilter to classify multiple
       messages named as files on the command line. If the -v or -t option is
       also given, bogofilter will print a line giving file name and
       classification information for each file.


       The -F (force) ignores threshold values when printing spamicity
       statistics.


       The -R option tells bogofilter to output an R data frame in text form
       on the standard output. See the section on integration with R, below,
       for further detail.


       REGISTRATION OPTIONS


       The -s option tells bogofilter to register the text presented on
       standard input as spam. The spam wordfile is created if absent.


       The -n option tells bogofilter to register the text presented on
       standard input as non-spam.


       Bogofilter doesn't detect if a message registered twice. If you do this
       by accident, the token counts will off by 1 from what you really want
       and the corresponding spam scores will be slightly off. Given a large
       number of tokens and messages in the wordlists, this doesn't matter.
       The problem _can_ be corrected by using the -S option or the -N option.


       The -S option tells bogofilter to undo a prior registration of the same
       message as spam. If a message was incorrectly entered in the spam
       wordfile by '-n' or '-u' and you want to remove it from the spam
       wordfile and enter it in the non-spam wordfile, use options '-Sn'. If
       '-S' is used for a message that wasn't registered as spam, the counts
       will still be decremented.


       The -N option tells bogofilter to undo a prior registration of the same
       message as non-spam. If a message was incorrectly entered in the non-
       spam wordfile by '-n' or '-u' and you want to remove it from the non-
       spam wordfile and enter it in the spam wordfile, then use '-Ns'. If
       '-N' is used for a message that wasn't registered as non-spam, the
       counts will still be decremented.


       GENERAL OPTIONS


       The -cfilename option tells bogofilter to read the config file named.


       The -C option prevents bogofilter from reading configuration files.


       The -d dir option allows you to set the directory under which wordlists
       will be found to dir. If omitted, the default directory will be
       $BOGOFILTER_DIR if BOGOFILTER_DIR is set and $HOME/.bogofilter
       otherwise.


       The -l option writes an informational line to the system log each time
       bogofilter is run. The information logged depends on how bogofilter is
       run.


       The -L tag option configures a tag which can be included in the
       information being logged by the -l option, but it requires a custom
       format that includes the %l string for now. This option implies -l.


       The -I filename option tells bogofilter to read its input from the
       specified file, rather than from stdin


       The -O filename option tells bogofilter where to write its output in
       passthrough mode. Note that this only works when -p is explicitly
       given.


       ALGORITHM OPTIONS


       The Robinson-Fisher method is the default algorithm used for computing
       a message's spamicity score, unless bogofilter has been compiled
       without it, by using the --disable-robinson-fisher option to the
       configure script. The method to be used can be specified on the command
       line or in the configuration file.


       The -g option selects the original Graham form of the calculation
       method.


       The -r option selects the Robinson modifications to the calculation
       method.


       The -f option selects the Robinson-Fisher modifications to the
       calculation method.


       The configure script has options --disable-graham-method, --disable-
       robinson-method, and --disable-robinson-fisher so that bogofilter can
       be built to support a subset of the available methods.


       PARSING OPTIONS


       Bogofilter has three special parsing options which can be enabled (or
       disabled) at the user's discretion. The options ar of form -Px and -PX
       where x designates an option letter. For the parsing options, a lower
       case letter enables the option and an upper case letter disables it.


       Options -Ph and -PH are for header line markup, i.e. whether to create
       special tags for header lines. When enable, tokens in "To:", "From:",
       "Return-Path:", and "Subject:" lines will be given special prefixes.
       Enabling this option increases bogofilter's accuracy.


       Options -Pi and -PI are for ignoring case, i.e. whether to map upper
       case to lower case (or not). Disabling this option increases
       bogofilter's accuracy.


       Options -Ph and -PH are for header line markup, i.e. whether to create
       special tags for header lines. When enable, tokens in "To:", "From:",
       "Return-Path:", and "Subject:" lines will be given special prefixes.
       This option increases bogofilter's accuracy.


       Options -Pt and -PT are for tokenizing the innards of 3 html tags, i.e.
       >a<, >img<, and >font<. Tokenizing these tags adds urls and font names
       to the message's tokens. Enabling this option increases bogofilter's
       accuracy.


       PARAMETER OPTIONS


       The -m value,value option allows setting the min_dev value and,
       optionally, the robs value. If one value is supplied, then min_dev is
       set. If a comma followed by one value is supplied, then robs is set.
       With two values, both min_dev and robs are set. Note the syntax is
       misleading, at least one of the values MUST be present, and the comma
       determines whether it is to set the spam or the ham cutoff. Note:
       spaces are not allowed after the comma.


       The -o value,value option allows setting the spam_cutoff value and,
       optionally, the ham_cutoff value. If one value is supplied, then
       spam_cutoff is set. If a comma followed by one value is supplied, then
       ham_cutoff is set. With two values, both spam_cutoff and ham_cutoff are
       set. Note the syntax is misleading, at least one of the values MUST be
       present, and the comma determines whether it is to set the spam or the
       ham cutoff. Note: spaces are not allowed after the comma.


       INFO OPTIONS


       The -q (quiet) suppresses warning messages.


       The -v option produces a report to standard output on bogofilter's
       analysis af the input. Each additional v will increase the verbosity of
       the output, up to a maximum of 4. With -vv, the report lists the tokens
       with highest deviation from a mean of 0.5 association with spam.


       Option -y date is specifies the date to give to tokens that don't have
       dates.


       The -D option redirects debug output to stdout.


       The -x flags option allows setting of debug flags for printing debug
       information.


ENVIRONMENT
       Bogofilter will initialize its data base directory to$BOGOFILTER_DIR if
       BOGOFILTER_DIR is set. If it is not set, bogofilter will use
       $HOME/.bogofilter instead. If neither BOGOFILTER_DIR nor HOME is set,
       the -d dir option must be present.


CONFIGURATION
       The bogofilter command line allows setting of many options that
       determine how bogofilter operates. File /usr/local/etc/bogofilter.cf
       can be used to set additional parameters that affect its operation.
       File /usr/local/etc/bogofilter.cf.example has samples of all of the
       parameters. Status and logging messages can be customized for each site
       (see /usr/local/etc/bogofilter.cf.example).


RETURN VALUES
       0 for spam; 1 for non-spam; 2 for I/O or other errors.


       If both -p and -e are used, the return values are: 0 for spam or non-
       spam; 2 for I/O or other errors.


       Error 2 usually means that the wordlist files bogofilter wants to read
       at startup are missing or the hard disk has filled up in -p mode.


INTEGRATION WITH OTHER TOOLS
       Use with Procmail


       The following procmail rule will take mail on stdin and direct it to
       Mail/spam if bogofilter thinks it's spam:

       :0HB:
       * ? bogofilter
       Mail/spam

        and this similar rule will also register the tokens in the mail
       according to the bogofilter classification:

       :0HB:
       * ? bogofilter -u
       Mail/spam




       If bogofilter fails (returning 2) the message will be treated as non-
       spam.


       The following recipe (a) spam-bins anything that bogofilter rates as
       spam, (b) adds the words in messages rated as spam to the spam
       wordlist, and (c) adds the words in messages rated as non-spam to the
       non-spam wordlist. With this in place, it will normally only be
       necessary for the user to intervene (with -Ns or -Sn) when bogofilter
       miscategorizes something.



       # filter mail through bogofilter, tagging it as spam and
       # updating the word lists

       :0fw
       | bogofilter -u -e -p


       # if bogofilter failed, return the mail to the queue, the MTA will
       # retry to deliver it later
       # 75 is the value for EX_TEMPFAIL in /usr/include/sysexits.h

       :0e
       { EXITCODE=75 HOST }


       # file the mail to spam-bogofilter if it's spam.

       :0:
       * ^X-Bogosity: Yes, tests=bogofilter
       spam-bogofilter




       This one is for maildrop, it automatically defers the mail and retries
       later when the xfilter command fails, use this in your ~/.mailfilter:

       xfilter "bogofilter -u -e -p"
       if (/^X-Bogosity: Yes, tests=bogofilter/)
       {
         to "spam-bogofilter"
       }



       The following .muttrc lines will create mutt macros for dispatching
       mail to bogofilter.

       macro index d "<enter-command>unset wait_key\n\
       <pipe-entry>bogofilter -n\n\
       <enter-command>set wait_key\n\
       <delete-message>" "delete message as non-spam"
       macro index \ed "<enter-command>unset wait_key\n\
       <pipe-entry>bogofilter -s\n\
       <enter-command>set wait_key\n\
       <delete-message>" "delete message as spam"



       Integration with Mail Transport Agent (MTA)


       1. bogofilter can also be integrated into an MTA to filter all incoming
          mail. While the specific implementation is MTA dependent, the
          general steps are as follows

       2. Install bogofilter on the mail server

       3. Prime the bogofilter databases with a spam and non-spam corpus.
          Since bogofilter will be serving a larger community, it is important
          to prime it with a representative set of messages.

       4. Set up the MTA to invoke bogofilter on each message. While this is
          an MTA specific step, you'll probably need to use the -p, -u, and -e
          options.

       5. Set up a mechanism for users to register spam/nonspam messages, as
          well as to correct mis-classifications. The most generic solution is
          to set up alias email addresses to which users bounce messages.

       6. See the doc and contrib directories for more information

       Use of R to verify Bogofilter calculations


       The -R option tells bogofilter to generate an R data frame. The data
       frame contains one row per token analysed. Each such row contains the
       token, the sum of its database "good" and "spam" counts, the "good"
       count divided by the number of non-spam messages used to create the
       training database, the "spam" count divided by the spam message count,
       Robinson's f(w) for the token, the natural logs of (1 - f(w)) and f(w),
       and an indicator character (+ if the token's f(w) value exceeded the
       minimum deviation from 0.5, - if it didn't). There is one additional
       row at the end of the table that contains a label in the token field,
       followed by the number of words actually used (the ones with +
       indicators), Robinson's P, Q, S, s and x values and the minimum
       deviation.


       The R data frame can be saved to a file and later read into an R
       session (see the R project website: http://cran.r-project.org for
       information about the mathematics package R). Provided with the
       bogofilter distribution is a simple R script (file bogo.R) that can be
       used to verify bogofilter's calculations. Instructions for its use are
       included in the script in the form of comments.


LOG MESSAGES
       Bogofilter writes messages to the system log when the -l option is
       used. What is written depends on which other flags are used.


       A classification run will generate (we are not showing the date and
       host part here):


              bogofilter[1412]: X-Bogosity: No, spamicity=0.000227
              bogofilter[1415]: X-Bogosity: Yes, spamicity=0.998918




       Using '-u' to classify a message and update a wordlist will produce
       (one a single line):


              bogofilter[1426]: X-Bogosity: Yes, spamicity=0.998918,
                register -s, 329 words, 1 messages




       Registering words ('-l' and '-s', '-n', '-S', or '-N') will produce:


              bogofilter[1440]: register-n, 255 words, 1 messages




       A registration run (using '-s', '-n', '-N', or '-S') will generate
       messages like:


              bogofilter[17330]: register-n, 574 words, 3 messages
              bogofilter[6244]: register-s, 1273 words, 4 messages




FILES
       /usr/local/etc/bogofilter.cf
              System configuration file.


       ~/.bogofilter.cf
              User configuration file.


       ~/.bogofilter/goodlist.db
              List of good tokens.


       ~/.bogofilter/spamlist.db
              List of spam tokens.


BUGS
       bogofilter counts messages on input by looking for "From " lines. As a
       special case, a single message without "From " line is counted
       correctly. Multiple messages without intervening "From " lines will be
       counted as one message.


       Bogofilter does not canonicalize the transport encoding or character
       set, sacrificing precision. We used to believe that spam with
       enclosures invariably gives itself away through cues in the headers and
       non-enclosure parts, but this is not true. This will be fixed in a
       future version.


AUTHOR
       Eric S. Raymond <esr@thyrsus.com>.


       For updates, see the bogofilter project page:
       http://bogofilter.sourceforge.net/.




                                                                 BOGOFILTER(1)