recoll.conf

RECOLL.CONF(5)                File Formats Manual               RECOLL.CONF(5)



NAME
       recoll.conf - main personal configuration file for Recoll

DESCRIPTION
       This file defines the index configuration for the Recoll full-text
       search system.

       The system-wide configuration file is normally located inside
       /usr/[local]/share/recoll/examples. Any parameter set in the common
       file may be overridden by setting it in the personal configuration
       file, by default: $HOME/.recoll/recoll.conf

       Please note while I try to keep this manual page reasonably up to date,
       it will frequently lag the current state of the software. The best
       source of information about the configuration are the comments in the
       system-wide configuration file or the user manual which you can access
       from the recoll GUI help menu or on the recoll web site.


       A short extract of the file might look as follows:

              # Space-separated list of directories to index.
              topdirs =  ~/docs /usr/share/doc

              [~/somedirectory-with-utf8-txt-files]
              defaultcharset = utf-8


       There are three kinds of lines:

              •      Comment or empty

              •      Parameter affectation

              •      Section definition

       Empty lines or lines beginning with # are ignored.

       Affectation lines are in the form 'name = value'.

       Section lines allow redefining a parameter for a directory subtree.
       Some of the parameters used for indexing are looked up hierarchically
       from the more to the less specific. Not all parameters can be
       meaningfully redefined, this is specified for each in the next section.

       The tilde character (~) is expanded in file names to the name of the
       user's home directory.

       Where values are lists, white space is used for separation, and
       elements with embedded spaces can be quoted with double-quotes.

OPTIONS
       topdirs = string
              Space-separated list of files or directories to recursively
              index. Default to ~ (indexes $HOME). You can use symbolic links
              in the list, they will be followed, independently of the value
              of the followLinks variable.

       skippedNames = string
              Files and directories which should be ignored.  White space
              separated list of wildcard patterns (simple ones, not paths,
              must contain no / ), which will be tested against file and
              directory names.  The list in the default configuration does not
              exclude hidden directories (names beginning with a dot), which
              means that it may index quite a few things that you do not want.
              On the other hand, email user agents like Thunderbird usually
              store messages in hidden directories, and you probably want this
              indexed. One possible solution is to have '.*'  in
              'skippedNames', and add things like '~/.thunderbird'
              '~/.evolution' to 'topdirs'.  Not even the file names are
              indexed for patterns in this list, see the 'noContentSuffixes'
              variable for an alternative approach which indexes the file
              names. Can be redefined for any subtree.

       noContentSuffixes = string
              List of name endings (not necessarily dot-separated suffixes)
              for which we don't try MIME type identification, and don't
              uncompress or index content. Only the names will be indexed.
              This complements the now obsoleted recoll_noindex list from the
              mimemap file, which will go away in a future release (the move
              from mimemap to recoll.conf allows editing the list through the
              GUI). This is different from skippedNames because these are name
              ending matches only (not wildcard patterns), and the file name
              itself gets indexed normally. This can be redefined for
              subdirectories.

       skippedPaths = string
              Paths we should not go into. Space-separated list of wildcard
              expressions for filesystem paths. Can contain files and
              directories. The database and configuration directories will
              automatically be added. The expressions are matched using
              'fnmatch(3)' with the FNM_PATHNAME flag set by default. This
              means that '/' characters must be matched explicitly. You can
              set 'skippedPathsFnmPathname' to 0 to disable the use of
              FNM_PATHNAME (meaning that '/*/dir3' will match
              '/dir1/dir2/dir3').  The default value contains the usual mount
              point for removable media to remind you that it is a bad idea to
              have Recoll work on these (esp. with the monitor: media gets
              indexed on mount, all data gets erased on unmount).  Explicitly
              adding '/media/xxx' to the topdirs will override this.

       skippedPathsFnmPathname = bool
              Set to 0 to override use of FNM_PATHNAME for matching skipped
              paths.

       daemSkippedPaths = string
              skippedPaths equivalent specific to real time indexing. This
              enables having parts of the tree which are initially indexed but
              not monitored. If daemSkippedPaths is not set, the daemon uses
              skippedPaths.

       zipSkippedNames = string
              Space-separated list of wildcard expressions for names that
              should be ignored inside zip archives. This is used directly by
              the zip handler, and has a function similar to skippedNames, but
              works independently. Can be redefined for subdirectories.
              Supported by recoll 1.20 and newer. See
              https://bitbucket.org/medoc/recoll/wiki/Filtering%20out%20Zip%20archive%20members


       followLinks = bool
              Follow symbolic links during indexing. The default is to ignore
              symbolic links to avoid multiple indexing of linked files. No
              effort is made to avoid duplication when this option is set to
              true. This option can be set individually for each of the
              'topdirs' members by using sections. It can not be changed below
              the 'topdirs' level. Links in the 'topdirs' list itself are
              always followed.

       indexedmimetypes = string
              Restrictive list of indexed mime types. Normally not set (in
              which case all supported types are indexed). If it is set, only
              the types from the list will have their contents indexed. The
              names will be indexed anyway if indexallfilenames is set
              (default). MIME type names should be taken from the mimemap
              file. Can be redefined for subtrees.

       excludedmimetypes = string
              List of excluded MIME types. Lets you exclude some types from
              indexing. Can be redefined for subtrees.

       compressedfilemaxkbs = int
              Size limit for compressed files. We need to decompress these in
              a temporary directory for identification, which can be wasteful
              in some cases. Limit the waste. Negative means no limit. 0
              results in no processing of any compressed file. Default 50 MB.

       textfilemaxmbs = int
              Size limit for text files. Mostly for skipping monster logs.
              Default 20 MB.

       indexallfilenames = bool
              Index the file names of unprocessed files Index the names of
              files the contents of which we don't index because of an
              excluded or unsupported MIME type.

       usesystemfilecommand = bool
              Use a system command for file MIME type guessing as a final step
              in file type identification This is generally useful, but will
              usually cause the indexing of many bogus 'text' files. See
              'systemfilecommand' for the command used.

       systemfilecommand = string
              Command used to guess MIME types if the internal methods fails
              This should be a "file -i" workalike.  The file path will be
              added as a last parameter to the command line. 'xdg-mime' works
              better than the traditional 'file' command, and is now the
              configured default (with a hard-coded fallback to 'file')

       processwebqueue = bool
              Decide if we process the Web queue. The queue is a directory
              where the Recoll Web browser plugins create the copies of
              visited pages.

       textfilepagekbs = int
              Page size for text files. If this is set, text/plain files will
              be divided into documents of approximately this size. Will
              reduce memory usage at index time and help with loading data in
              the preview window at query time. Particularly useful with very
              big files, such as application or system logs. Also see
              textfilemaxmbs and compressedfilemaxkbs.

       membermaxkbs = int
              Size limit for archive members. This is passed to the filters in
              the environment as RECOLL_FILTER_MAXMEMBERKB.

       indexStripChars = bool
              Decide if we store character case and diacritics in the index.
              If we do, searches sensitive to case and diacritics can be
              performed, but the index will be bigger, and some marginal
              weirdness may sometimes occur. The default is a stripped index.
              When using multiple indexes for a search, this parameter must be
              defined identically for all. Changing the value implies an index
              reset.

       nonumbers = bool
              Decides if terms will be generated for numbers. For example
              "123", "1.5e6", 192.168.1.4, would not be indexed if nonumbers
              is set ("value123" would still be). Numbers are often quite
              interesting to search for, and this should probably not be set
              except for special situations, ie, scientific documents with
              huge amounts of numbers in them, where setting nonumbers will
              reduce the index size. This can only be set for a whole index,
              not for a subtree.

       dehyphenate = bool
              Determines if we index 'coworker' also when the input is 'co-
              worker'.  This is new in version 1.22, and on by default.
              Setting the variable to off allows restoring the previous
              behaviour.

       nocjk = bool
              Decides if specific East Asian (Chinese Korean Japanese)
              characters/word splitting is turned off. This will save a small
              amount of CPU if you have no CJK documents. If your document
              base does include such text but you are not interested in
              searching it, setting nocjk may be a significant time and space
              saver.

       cjkngramlen = int
              This lets you adjust the size of n-grams used for indexing CJK
              text. The default value of 2 is probably appropriate in most
              cases. A value of 3 would allow more precision and efficiency on
              longer words, but the index will be approximately twice as
              large.

       indexstemminglanguages = string
              Languages for which to create stemming expansion data. Stemmer
              names can be found by executing 'recollindex -l', or this can
              also be set from a list in the GUI.

       defaultcharset = string
              Default character set. This is used for files which do not
              contain a character set definition (e.g.: text/plain). Values
              found inside files, e.g. a 'charset' tag in HTML documents, will
              override it. If this is not set, the default character set is
              the one defined by the NLS environment ($LC_ALL, $LC_CTYPE,
              $LANG), or ultimately iso-8859-1 (cp-1252 in fact).  If for some
              reason you want a general default which does not match your LANG
              and is not 8859-1, use this variable. This can be redefined for
              any sub-directory.

       unac_except_trans = string
              A list of characters, encoded in UTF-8, which should be handled
              specially when converting text to unaccented lowercase. For
              example, in Swedish, the letter a with diaeresis has full
              alphabet citizenship and should not be turned into an a.  Each
              element in the space-separated list has the special character as
              first element and the translation following. The handling of
              both the lowercase and upper-case versions of a character should
              be specified, as appartenance to the list will turn-off both
              standard accent and case processing. The value is global and
              affects both indexing and querying.  Examples:

              Swedish:

              unac_except_trans = ää Ãä öö Ãö üü Ãü Ãss Åoe Åoe æae
              Ãae ï¬ff ï¬fi ï¬fl åå ÃÃ¥

              German:

              unac_except_trans = ää Ãä öö Ãö üü Ãü Ãss Åoe Åoe æae
              Ãae ï¬ff ï¬fi ï¬fl

              In French, you probably want to decompose oe and ae and nobody
              would type a German Ã

              unac_except_trans = Ãss Åoe Åoe æae Ãae ï¬ff ï¬fi ï¬fl

              The default for all until someone protests follows. These
              decompositions are not performed by unac, but it is unlikely
              that someone would type the composed forms in a search.

              unac_except_trans = Ãss Åoe Åoe æae Ãae ï¬ff ï¬fi ï¬fl

       maildefcharset = string
              Overrides the default character set for email messages which
              don't specify one. This is mainly useful for readpst (libpst)
              dumps, which are utf-8 but do not say so.

       localfields = string
              Set fields on all files (usually of a specific fs area). Syntax
              is the usual: name = value ; attr1 = val1 ; [...]  value is
              empty so this needs an initial semi-colon. This is useful, e.g.,
              for setting the rclaptg field for application selection inside
              mimeview.

       testmodifusemtime = bool
              Use mtime instead of ctime to test if a file has been modified.
              The time is used in addition to the size, which is always used.
              Setting this can reduce re-indexing on systems where extended
              attributes are used (by some other application), but not
              indexed, because changing extended attributes only affects
              ctime.  Notes: - This may prevent detection of change in some
              marginal file rename cases (the target would need to have the
              same size and mtime).  - You should probably also set
              noxattrfields to 1 in this case, except if you still prefer to
              perform xattr indexing, for example if the local file update
              pattern makes it of value (as in general, there is a risk for
              pure extended attributes updates without file modification to go
              undetected). Perform a full index reset after changing this.


       noxattrfields = bool
              Disable extended attributes conversion to metadata fields. This
              probably needs to be set if testmodifusemtime is set.

       metadatacmds = string
              Define commands to gather external metadata, e.g. tmsu tags.
              There can be several entries, separated by semi-colons, each
              defining which field name the data goes into and the command to
              use. Don't forget the initial semi-colon. All the field names
              must be different. You can use aliases in the "field" file if
              necessary.  As a not too pretty hack conceded to convenience,
              any field name beginning with "rclmulti" will be taken as an
              indication that the command returns multiple field values inside
              a text blob formatted as a recoll configuration file ("fieldname
              = fieldvalue" lines). The rclmultixx name will be ignored, and
              field names and values will be parsed from the data.  Example:
              metadatacmds = ; tags = tmsu tags %f; rclmulti1 = cmdOutputsConf
              %f


       cachedir = dfn
              Top directory for Recoll data. Recoll data directories are
              normally located relative to the configuration directory (e.g.
              ~/.recoll/xapiandb, ~/.recoll/mboxcache). If 'cachedir' is set,
              the directories are stored under the specified value instead
              (e.g. if cachedir is ~/.cache/recoll, the default dbdir would be
              ~/.cache/recoll/xapiandb).  This affects dbdir, webcachedir,
              mboxcachedir, aspellDicDir, which can still be individually
              specified to override cachedir.  Note that if you have multiple
              configurations, each must have a different cachedir, there is no
              automatic computation of a subpath under cachedir.

       maxfsoccuppc = int
              Maximum file system occupation over which we stop indexing. The
              value is a percentage, corresponding to what the "Capacity" df
              output column shows. The default value is 0, meaning no
              checking.

       xapiandb = dfn
              Xapian database directory location. This will be created on
              first indexing. If the value is not an absolute path, it will be
              interpreted as relative to cachedir if set, or the configuration
              directory (-c argument or $RECOLL_CONFDIR).  If nothing is
              specified, the default is then ~/.recoll/xapiandb/

       idxstatusfile = fn
              Name of the scratch file where the indexer process updates its
              status. Default: idxstatus.txt inside the configuration
              directory.

       mboxcachedir = dfn
              Directory location for storing mbox message offsets cache files.
              This is normally 'mboxcache' under cachedir if set, or else
              under the configuration directory, but it may be useful to share
              a directory between different configurations.

       mboxcacheminmbs = int
              Minimum mbox file size over which we cache the offsets. There is
              really no sense in caching offsets for small files. The default
              is 5 MB.

       webcachedir = dfn
              Directory where we store the archived web pages. This is only
              used by the web history indexing code Default: cachedir/webcache
              if cachedir is set, else $RECOLL_CONFDIR/webcache

       webcachemaxmbs = int
              Maximum size in MB of the Web archive. This is only used by the
              web history indexing code.  Default: 40 MB.  Reducing the size
              will not physically truncate the file.

       webqueuedir = fn
              The path to the Web indexing queue. This is hard-coded in the
              plugin as ~/.recollweb/ToIndex so there should be no need or
              possibility to change it.

       aspellDicDir = dfn
              Aspell dictionary storage directory location. The aspell
              dictionary (aspdict.(lang).rws) is normally stored in the
              directory specified by cachedir if set, or under the
              configuration directory.

       filtersdir = dfn
              Directory location for executable input handlers. If
              RECOLL_FILTERSDIR is set in the environment, we use it instead.
              Defaults to $prefix/share/recoll/filters. Can be redefined for
              subdirectories.

       iconsdir = dfn
              Directory location for icons. The only reason to change this
              would be if you want to change the icons displayed in the result
              list. Defaults to $prefix/share/recoll/images

       idxflushmb = int
              Threshold (megabytes of new data) where we flush from memory to
              disk index. Setting this allows some control over memory usage
              by the indexer process. A value of 0 means no explicit flushing,
              which lets Xapian perform its own thing, meaning flushing every
              $XAPIAN_FLUSH_THRESHOLD documents created, modified or deleted:
              as memory usage depends on average document size, not only
              document count, the Xapian approach is is not very useful, and
              you should let Recoll manage the flushes.  The default value of
              idxflushmb is 10 MB, and may be a bit low. If you are looking
              for maximum speed, you may want to experiment with values
              between 20 and 80. In my experience, values beyond 100 are
              always counterproductive. If you find otherwise, please drop me
              a note.

       filtermaxseconds = int
              Maximum external filter execution time in seconds. Default 1200
              (20mn). Set to 0 for no limit. This is mainly to avoid infinite
              loops in postscript files (loop.ps)

       filtermaxmbytes = int
              Maximum virtual memory space for filter processes
              (setrlimit(RLIMIT_AS)), in megabytes. Note that this includes
              any mapped libs (there is no reliable Linux way to limit the
              data space only), so we need to be a bit generous here. Anything
              over 2000 will be ignored on 32 bits machines.

       thrQSizes = string
              Stage input queues configuration. There are three internal
              queues in the indexing pipeline stages (file data extraction,
              terms generation, index update). This parameter defines the
              queue depths for each stage (three integer values). If a value
              of -1 is given for a given stage, no queue is used, and the
              thread will go on performing the next stage. In practise, deep
              queues have not been shown to increase performance. Default: a
              value of 0 for the first queue tells Recoll to perform
              autoconfiguration based on the detected number of CPUs (no need
              for the two other values in this case).  Use thrQSizes = -1 -1
              -1 to disable multithreading entirely.

       thrTCounts = string
              Number of threads used for each indexing stage. The three stages
              are: file data extraction, terms generation, index update). The
              use of the counts is also controlled by some special values in
              thrQSizes: if the first queue depth is 0, all counts are ignored
              (autoconfigured); if a value of -1 is used for a queue depth,
              the corresponding thread count is ignored. It makes no sense to
              use a value other than 1 for the last stage because updating the
              Xapian index is necessarily single-threaded (and protected by a
              mutex).

       loglevel = int
              Log file verbosity 1-6. A value of 2 will print only errors and
              warnings. 3 will print information like document updates, 4 is
              quite verbose and 6 very verbose.

       logfilename = fn
              Log file destination. Use 'stderr' (default) to write to the
              console.

       idxloglevel = int
              Override loglevel for the indexer.

       idxlogfilename = fn
              Override logfilename for the indexer.

       daemloglevel = int
              Override loglevel for the indexer in real time mode. The default
              is to use the idx... values if set, else the log... values.

       daemlogfilename = fn
              Override logfilename for the indexer in real time mode. The
              default is to use the idx... values if set, else the log...
              values.

       idxrundir = dfn
              Indexing process current directory. The input handlers sometimes
              leave temporary files in the current directory, so it makes
              sense to have recollindex chdir to some temporary directory. If
              the value is empty, the current directory is not changed. If the
              value is (literal) tmp, we use the temporary directory as set by
              the environment (RECOLL_TMPDIR else TMPDIR else /tmp). If the
              value is an absolute path to a directory, we go there.

       checkneedretryindexscript = fn
              Script used to heuristically check if we need to retry indexing
              files which previously failed.  The default script checks the
              modified dates on /usr/bin and /usr/local/bin. A relative path
              will be looked up in the filters dirs, then in the path. Use an
              absolute path to do otherwise.

       recollhelperpath = string
              Additional places to search for helper executables. This is only
              used on Windows for now.

       idxabsmlen = int
              Length of abstracts we store while indexing. Recoll stores an
              abstract for each indexed file.  The text can come from an
              actual 'abstract' section in the document or will just be the
              beginning of the document. It is stored in the index so that it
              can be displayed inside the result lists without decoding the
              original file. The idxabsmlen parameter defines the size of the
              stored abstract. The default value is 250 bytes. The search
              interface gives you the choice to display this stored text or a
              synthetic abstract built by extracting text around the search
              terms. If you always prefer the synthetic abstract, you can
              reduce this value and save a little space.

       idxmetastoredlen = int
              Truncation length of stored metadata fields. This does not
              affect indexing (the whole field is processed anyway), just the
              amount of data stored in the index for the purpose of displaying
              fields inside result lists or previews. The default value is 150
              bytes which may be too low if you have custom fields.

       aspellLanguage = string
              Language definitions to use when creating the aspell dictionary.
              The value must match a set of aspell language definition files.
              You can type "aspell dicts"  to see a list The default if this
              is not set is to use the NLS environment to guess the value.

       aspellAddCreateParam = string
              Additional option and parameter to aspell dictionary creation
              command. Some aspell packages may need an additional option
              (e.g. on Debian Jessie: --local-data-dir=/usr/lib/aspell). See
              Debian bug 772415.

       aspellKeepStderr = bool
              Set this to have a look at aspell dictionary creation errors.
              There are always many, so this is mostly for debugging.

       noaspell = bool
              Disable aspell use. The aspell dictionary generation takes time,
              and some combinations of aspell version, language, and local
              terms, result in aspell crashing, so it sometimes makes sense to
              just disable the thing.

       monauxinterval = int
              Auxiliary database update interval. The real time indexer only
              updates the auxiliary databases (stemdb, aspell) periodically,
              because it would be too costly to do it for every document
              change. The default period is one hour.

       monixinterval = int
              Minimum interval (seconds) between processings of the indexing
              queue. The real time indexer does not process each event when it
              comes in, but lets the queue accumulate, to diminish overhead
              and to aggregate multiple events affecting the same file.
              Default 30 S.

       mondelaypatterns = string
              Timing parameters for the real time indexing. Definitions for
              files which get a longer delay before reindexing is allowed.
              This is for fast-changing files, that should only be reindexed
              once in a while. A list of wildcardPattern:seconds pairs. The
              patterns are matched with fnmatch(pattern, path, 0) You can
              quote entries containing white space with double quotes (quote
              the whole entry, not the pattern). The default is empty.
              Example: mondelaypatterns = *.log:20 "*with spaces.*:30"

       monioniceclass = int
              ionice class for the real time indexing process On platforms
              where this is supported. The default value is 3.

       monioniceclassdata = string
              ionice class parameter for the real time indexing process. On
              platforms where this is supported. The default is empty.

       autodiacsens = bool
              auto-trigger diacritics sensitivity (raw index only). IF the
              index is not stripped, decide if we automatically trigger
              diacritics sensitivity if the search term has accented
              characters (not in unac_except_trans). Else you need to use the
              query language and the "D" modifier to specify diacritics
              sensitivity. Default is no.

       autocasesens = bool
              auto-trigger case sensitivity (raw index only). IF the index is
              not stripped (see indexStripChars), decide if we automatically
              trigger character case sensitivity if the search term has upper-
              case characters in any but the first position. Else you need to
              use the query language and the "C" modifier to specify
              character-case sensitivity. Default is yes.

       maxTermExpand = int
              Maximum query expansion count for a single term (e.g.: when
              using wildcards). This only affects queries, not indexing. We
              used to not limit this at all (except for filenames where the
              limit was too low at 1000), but it is unreasonable with a big
              index. Default 10000.

       maxXapianClauses = int
              Maximum number of clauses we add to a single Xapian query. This
              only affects queries, not indexing. In some cases, the result of
              term expansion can be multiplicative, and we want to avoid
              eating all the memory. Default 50000.

       snippetMaxPosWalk = int
              Maximum number of positions we walk while populating a snippet
              for the result list. The default of 1,000,000 may be
              insufficient for very big documents, the consequence would be
              snippets with possibly meaning-altering missing words.

       pdfocr = bool
              Attempt OCR of PDF files with no text content if both tesseract
              and pdftoppm are installed. The default is off because OCR is so
              very slow.

       pdfattach = bool
              Enable PDF attachment extraction by executing pdftk (if
              available). This is normally disabled, because it does slow down
              PDF indexing a bit even if not one attachment is ever found.

       mhmboxquirks = string
              Enable thunderbird/mozilla-seamonkey mbox format quirks Set this
              for the directory where the email mbox files are stored.


SEE ALSO
       recollindex(1) recoll(1)



                               14 November 2012                 RECOLL.CONF(5)