ferret

BIN::FERRET(1)        User Contributed Perl Documentation       BIN::FERRET(1)



NAME
       ferret - "Ferret" search engine command-line interface

SYNOPSIS
       ferret [--index=<file>] <action> [action parms] [...]

       Actions:
           addfile        adds named file(s) to index
           addstoppers    adds stopper (non-indexed) words
           commonwords    report words existing in many files
           query          searches for specified words
           removefile     removes named file(s) from index
           removestoppers removes stopper (non-indexed) words
           setoption      set specified options
           shrink         reduce index size after adds/removes
           unsetoption    unset specified options

DESCRIPTION
       The ferret command provides a convienient interface to update or
       perform maintenence on an index.

       General Options

       The following option is applicable to all actions.

       --index=<file>
           This option specifies which index file is to be used.  If this
           option is omitted, it will default to the value of the environment
           variable FERRET_INDEX or "./ferret.index" if FERRET_INDEX is not
           set.

AddFile
       Use: ferret [...] addfile [--addprefix=<path>] [--stripprefix=<regex>]
       [--notitles] [--summary=<size>] [--lines=<number>] [--doctype=<type>]
       [--filter='<program & args>'] [--force] <file-to-add> [...]

       Options:
           addprefix    add to each filename before indexing
           doctype      process as documents of this type
           filter       run raw data through this program
           force        add regardless of modification time
           lines        limit summary to this number of lines
           notitles     supress titles for indexed documents
           stripprefix  strip from each filename before indexing
           summary      limit summary to this number of bytes

       Creating an index of a local repository of files is easily done with
       the "ferret addfile" maintenence command.  A command such as

           find /usr/local/repository -type f -print ⎪ xargs -n 500 \
                ferret --index=/usr/local/index addfile

       will create an index of the text of all files under the 'repository'
       directory in /usr/local.  The 'ferret' program is smart enough to
       recognize standard extensions and handle them accordingly.  Compressed
       files are uncompressed before being indexed.  Recognized file types are
       HTML, Code (C/C++, Perl, shell-script), and MIF (Maker Interchange
       Format).  Any unknown types are assumed to be plain text.  This is a
       safe fall-back since any binary codes are stripped out before any
       document is added to the index.

       After calling "ferret addfile" multiple times, as 'xargs' is bound to
       do, it might help to run the command

           ferret --index=/usr/local/index shrink

       to remove any obsolete data and pack the index into as small a file as
       possible.

       AddFile-specific Parameters

       The following parameters are specific to the "addfile" action.

       --addprefix=<string>
           This option will cause AddFile to prepend a given string to each
           filename after it has been read from disk but before it is stored
           in the index.  This could be used to change the pathnames so they
           will work when seen by the query program.  Note that this happens
           after any prefix has been stripped (see --stripprefix).

           If not specified, nothing will be prepended to the pathnames.

           Example:  --addprefix="/mnt/"

       --doctype=<type>
           This option will cause AddFile to add all listed document as being
           of the specified type.  This is best used when adding difficult to
           recognize types such as code for various languages.  Valid types
           are HTML, Code, MIF, and Text.  Unlike many options, these type
           name are case sensitive.

           If not specified, AddFile will try to dynamically determine the
           type for each file on an individual basis based on the filename
           extension and its content.

           Example:  --doctype=Code

       --filter='<program & args'>
           This option will cause AddFile to run each file through the given
           program (with the specified arguments) before indexing the data.
           This program must accept the file on STDIN and write its output to
           STDOUT.

           If not specified, AddFile will dynamically determine what filters
           should be run based on the file's extension.  Currently supported
           extensions are .gz (gzip) and .Z (compress).

           Example:  --filter='gzip -d ⎪ pod2text'

       --force
           This option will cause AddFile to add every file on the command
           line even if the last modification date of those files is older
           than when they were last added to the index.

           Example:  --filter='gzip -d ⎪ pod2text'

       --lines=<number>
           This option will cause AddFile to limit summaries to a maximum of
           this number of lines.  This option will not affect the summaries
           generated for HTML documents because there is no way to count lines
           until the document is actually displayed.

           If not specified, a maximum of 5 lines will be generated.

           Example:  --lines=10

       --notitles=
           This option will cause AddFile to supress storing titles for the
           documents it indexes.  Display/Query programs can use titles to
           better show the user what matches have been found.

           If not specified, titles will be stored.

           Example:  --notitles

       --stripprefix=<regex>
           This option will cause AddFile to remove the specified pattern from
           each filename after it has been read from disk but before it is
           stored in the index.  This could be used to change the pathnames so
           they will work when seen by the query program.  Note that this
           happens after before prefix is added (see --addprefix).  The regex
           is evaluated (case insensitive) by Perl and thus can use all of the
           Perl extensions.

           If not specified, nothing will be stripped from the pathnames.

           Example:  --stripprefix=".*/"

       --summary=<size>
           This option will cause AddFile to limit the size summaries can be.
           This is most useful for HTML documents since straight text
           documents are better restricted by number of lines (--lines
           option).

           If not specified, the maximum summary size is 250 bytes.

           Example:  --summary=500

AddStoppers
       Use: ferret [...] addstoppers <word> [...]

       Ferret uses a concept of "stoppers" to reduce the amount of information
       it stores in its index.  These non-content words (eg. "the", "where",
       "shouldn't", etc.) can safely be removed from most indexes because they
       don't actually mean anything.  A default list is provided by Ferret,
       but can easily be added to on a per-index basis using this function.
       Once added, these words are remembered until the index is deleted or
       they are removed using the removestoppers command.

       Once added, that word and all the data associated with it is forever
       removed from the index, thus reducing the size of the index file.
       Stoppers are also removed from queries in an intelligent manner so most
       users will never notice that some words were completely ignored.

CommonWords
       Use: ferret [...] commonwords <min> [max]

       Both min and max can be either a number between 0 and 1 to indicate a a
       frequency (eg. 0.90 = 90% of all documents) or a whole number greater
       than 1 to indicate an exact number of documents.  Running "commonwords
       0.90" will return a list of words in more than 90% of the documents.
       This is useful for determining words to be added to the "stopper" list.
       (see: "addstoppers" and "removestoppers") Running "commonwords 0.00
       0.10" will list all the words that are in less than 10% of all
       documents.

       In general, using addstoppers to avoid indexing words that appear in
       more than 50% of all documents will not degrade query performance and
       can reduce the index size by a siginificant amount.

Query
       Use: ferret [...] query [--summaries] '<query-string>'

       Options:
           summaries    display summaries for all matches found

       The query string should be enclosed in single quotes so that double
       quotes can be passed as part of the query.  Any single quotes in the
       query (for apostrophes in contractions) will have to be escaped with a
       backslash.

       For more information on the format of the query, see the ferret(3) man
       page.

RemoveFile
       Use: ferret [...] removefile [--addprefix=<path>]
       [--stripprefix=<regex>] <file-to-remove> [...]

       Options:
           addprefix    add to each filename before indexing
           stripprefix  strip from each filename before indexing

       The RemoveFile action will remove a document from the index.  Note that
       the name provided as file-to-remove must match the indexed name -- not
       the file name.  The indexed name is the file name with any prefixes
       stripped and added.

       The following parameters are specific to the "removefile" action and
       behave identically to those of the "addfile" action.  See the AddFile
       entry elsewhere in this document   for more information.

RemoveStoppers
       Use: ferret [...] removestoppers <word> [...]

       If there are "stopper" words that should be indexed, simply remove
       those words from the list using this command.  Note that 1 and 2 letter
       words are always considered stoppers unless turned off using the
       nostoppers option (see the NoStoppers entry elsewhere in this document
       ).

       If a stopper is removed from the list, all future documents will be
       indexed on that word.  Documents already indexed, however, will not be
       until next updated.

SetOption / UnsetOption
       Use: ferret [...] setoptionunsetoption <option> [...]

       Options:
           tiny         make index as small as possible
           nostoppers   don't remove any stopper words

       This action will set index-specific options.  These options will remain
       in effect until the index file is deleted or until unset with the
       "UnsetOption" action.

       Tiny
           A "tiny" database can use significantly less disk space but does so
           at the cost of all proximity searches, including phrases.

           If this option is set when documents already exist in the index,
           there will be a delay while all of the extraneous information is
           removed.  Be sure to run the the shrink action afterwards to
           reclaim disk space back from the index file.

           Important:  This option cannot be unset!  The index file must be
           erased and regenerated from scratch.

       NoStoppers
           If it is important to be able to search for all words, the
           "nostoppers" option will ensure that every word is indexed.  While
           this can dramatically increase the size of an index file, it does
           ensure that, for example, searching Shakespeare for "to be or not
           to be" (a phrase composed entirely of stoppers) will succeed.
           (Hamlet, Act III, Scene I)

           Note that documents are indexed and queries are run based on the
           current state of this option, which could cause some confusion to
           the user if changed without reindexing all of the stored documents.
           If this option is to be used, it should probably be set immediately
           after opening the index for the first time and never unset.

Shrink
       Use: ferret [...] shrink

       This action will reclaim all unused space from the index.  While that
       space would be reused during future addfile actions, it will only be
       returned to the file system with this call.

COPYRIGHT
       Ferret is copyright (c) 1996 by Verisim, Inc.

SEE ALSO
       ferret(3)

       For more information, join the Ferret mailing list by mailing the
       command "subscribe ferret-list" to the address: ferret-list-
       requst@verisim.com

BUGS
       Report bugs to: ferret-bugs@verisim.com   Please include the version of
       Ferret being run, a detailed description of the problem, and how to
       reproduce it.  Thank you!




3rd Berkeley Distribution    perl 5.003, patch 07               BIN::FERRET(1)