sitescooper






sitescooper − download news from web sites and convert it
automatically into one of several formats suitable for
viewing on a Palm handheld.

sitescooper [options] [ [−site sitename] ...]

     sitescooper [options] [−sites sitename ...]

     sitescooper [options] [−name nm] [−levels n] [−storyurl
regexp]      [−set sitefileparam value] url [...]

     Options: [−debug] [−refresh] [−fullrefresh] [−config
file] [−install dir] [−instapp app] [−dump] [−dumpprc]
[−nowrite] [−nodates] [−quiet] [−admin cmd] [−nolinkrewrite]
[−stdout‐to file] [−badcache] [−keep‐tmps] [−fromcache]
[−noheaders] [−nofooters] [−outputtemplate file.tmpl]
[−grep] [−profile file.nhp] [−profiles file.nhp file2.nhp
...]  [−filename template] [−prctitle template] [−parallel]
[−disc] [−limit numkbytes] [−maxlinks numlinks] [−maxstories
numstories]

     [−text ⎪ −html ⎪ −mhtml ⎪ −doc ⎪ −plucker ⎪ −mplucker ⎪
−isilo ⎪ −misilo ⎪ −richreader ⎪ −pipe fmt command] [−bw ⎪
−color] [−cvtargs args_for_converter]

This script, in conjunction with its configuration file and
its set of site files, will download news stories from
several top news sites into text format and/or onto your
Palm handheld (with the aid of the makedoc/MakeDocW or iSilo
utilities).

     Alternatively URLs can be supplied on the command line,
in which case those URLs will be downloaded and converted
using a reasonable set of default settings.

     HTTP and local files, using the file:/// protocol, are
both supported.

     Multiple types of sites are supported: 1‐level sites,
where the text to be converted is all present on one page
(such as Slashdot, Linux Weekly News, BluesNews, NTKnow, Ars
Technica);

2‐level sites, where the text to be converted is linked to
from a Table of Contents page (such as Wired News, BBC News,
and I, Cringely);

3‐level sites, where the text to be converted is linked to
from a Table of Contents page, which in turned is linked to
from a list of issues page (such as PalmPower).

     In addition sites that post news as items on one big
page, such as Slashdot, Ars Technica, and BluesNews, are
supported using diff.









                             ‐2‐


     Note that at this moment in time, the URLs‐on‐the‐
command‐line invocation format does not support 2‐ or
3‐level sites.

     The script is portable to most UNIX variants that
support perl, as well as the Win32 platform (tested with
ActivePerl 5.00502 build 509).

     sitescooper maintains a cache in its temporary
directory; files are kept in this cache for a week at most.
Ditto for the text output directory (set with TextSaveDir in
the built‐in configuration).

     If a password is required for the site, and the current
sitescooper session is interactive, the user will be
prompted for the username and password.  This authentication
token will be saved for later use.  This way a site that
requires login can be set up as a .site ‐‐ just log in once,
and your password is saved for future non‐interactive runs.

     Note however that the encryption used to hide the
password in the sitescooper configuration is pretty
transparent; I recommend that rather than using your own
username and password to log in to passworded sites, a
dedicated, sitescooper account is used instead.



‐refresh
    Refresh all links ‐‐ ignore the already_seen file, do
    not diff pages, and always fetch links.  If a cached
    page is available, it will be used.

‐fullrefresh
    Refresh all links ‐‐ ignore the already_seen file, do
    not diff pages, and always fetch links, even if they are
    available in the cache.

‐config file
    Read the configuration from file instead of using the
    built‐in one.

‐limit numkbytes
    Set the limit for output file size to numkbytes
    kilobytes, instead of the default 200K. A limit of 0
    means unlimited, any amount of output.

‐maxlinks numlinks
    Stop retrieving web pages after numlinks have been
    traversed. This is not used to specify how "deep" a site
    should be scooped ‐‐ it is the number of links followed
    in total.











                             ‐3‐


‐maxstories numstories
    Stop retrieving web pages after numstories stories have
    been retrieved.

‐install dir
    The directory to save PDB files to once they’ve been
    converted, in order to have them installed to your Palm
    handheld.

‐instapp app
    The application to run to install PDB files onto your
    Palm, once they’ve been converted.

‐site sitename
    Limit the run to the site named in the sitename
    argument.  Normally all available sites will be
    downloaded. To limit the run to 2 or more sites, provide
    multiple −site arguments like so:

            ‐site ntk.site ‐site tbtf.site


‐sites sitename [...]
    Limit the run to multiple sites; an easier way to
    specify multiple sites than using the −site argument for
    each file.

‐grep
    Use James Brown’s NewsHound profile searching code.  Any
    sites that do not contain IgnoreProfiles: 1 will then be
    searched for the active profiles.  Active profiles are
    loaded from the ProfileDir specified in the sitescooper
    configuration file, or specified using the −profile or
    −profiles arguments.

‐profile file.nhp
    Limit the run to the site named in the file.nhp
    argument.  Normally all available sites will be
    downloaded. To limit the run to 2 or more sites, provide
    multiple −profile arguments like so:

            ‐profile ntk.site ‐profile tbtf.site


‐profiles file.nhp [...]
    Limit the run to multiple sites; an easier way to
    specify multiple sites than using the −profile argument
    for each file.

‐name name
    When specifying a URL on the command‐line, this provides
    the name that should be used when installing the site to
    the Pilot. It acts exactly the same way as the Name:
    field in a site file.









                             ‐4‐


‐levels n
    When specifying a URL on the command‐line, this
    indicates how many levels a site has. Not needed when
    using .site files.

‐storyurl regexp
    When specifying a URL on the command‐line, this
    indicates the regular expression which links to stories
    should conform to. Not needed when using .site files.

‐doc
    Convert the page(s) downloaded into DOC format, with all
    the articles listed in full, one after the other.

‐text
    Convert the page(s) downloaded into plain text format,
    with all the articles listed in full, one after the
    other.

‐html
    Convert the page(s) downloaded into HTML format, on one
    big page, with a table of contents (taken from the site
    if possible), followed by all the articles one after
    another.

‐mhtml
    Convert the page(s) downloaded into HTML format, but
    retain the multiple‐page format. This will create the
    output in a directory called site_name; in conjunction
    with the −dump argument, it will output the path of this
    directory on standard output before exiting.

‐plucker
    Convert the page(s) downloaded into Plucker format (see
    http://plucker.gnu‐designs.com/ ), on one big page.  The
    page(s) will be displayed with a table of contents
    (taken from the site if possible), followed by all the
    articles one after another.

‐isilo
    Convert the page(s) downloaded into iSilo format (see
    http://www.isilo.com/ ), on one big page.  This is the
    default.  The page(s) will be displayed with a table of
    contents (taken from the site if possible), followed by
    all the articles one after another.

‐misilo
    Convert the page(s) downloaded into iSilo format (see
    http://www.isilo.com/ ), with one iSilo document per
    site, with each story on a separate page.  The iSilo
    document will have a table‐of‐contents page, taken from
    the site if possible, with each article on a separate
    page.










                             ‐5‐


‐richreader
    Convert the page(s) downloaded into RichReader format
    using HTML2Doc.exe (see
    http://users.erols.com/arenakm/palm/RichReader.html ).
    The page(s) will be displayed with a table of contents
    (taken from the site if possible), followed by all the
    articles one after another.

‐pipe fmt command
    Convert the page(s) downloaded into an arbitrary format,
    using the command provided. Sitescooper will still
    rewrite the page(s) according to the fmt argument, which
    should be one of:

text    Plain text format.

html    HTML in one big page.

mhtml   HTML in multiple pages.

        The command argument can contain __SCOOPFILE__,
        which will be replaced with the filename of the file
        containing the rewritten pages in the above format,
        __SYNCFILE__, which will be replaced with a suitable
        filename in the Palm synchronization folder, and
        __TITLE__, which will be replaced by the title of
        the file (generally a string containing the date and
        site name).

        Note that for the −mhtml switch, __SCOOPFILE__ will
        be replaced with the name of the file containing the
        table‐of‐contents page. It’s up to the conversion
        utility to follow the href links to the other files
        in that directory.

‐cvtargs
    Arguments for the conversion utility. For example,
    Plucker will display images better on some Palms using
    "−cvtargs ‐‐bpp=2" or "−cvtargs ‐‐bpp=4".

‐bw Indicate that the target can display only 2‐bit images,
    black and white only.  This is generally the default for
    iSilo and Plucker.

‐color
    Indicate that the target can display colour images.

‐fixlinks
    Rewrite links to external sites or unscooped pages as
    underlined text, to differentiate them from links to
    scooped pages.  This is the default behaviour for most
    formats apart from −plucker or −mplucker.











                             ‐6‐


‐keeplinks
    Do not rewrite links to external sites or unscooped
    pages; leave them pointing outside the current scoop.
    However, links to other pages that are included in the
    current scoop, are rewritten to point to the scooped
    pages instead of the source URL. This is the default for
    Plucker (−plucker or −mplucker arguments).

‐nolinkrewrite
    Do not rewrite links on scooped documents ‐‐ leave them
    exactly as they are.  This includes even links to other
    scooped pages.  See also −keeplinks).

‐dump
    Output the page(s) downloaded directly to stdout in text
    or HTML format, instead of writing them to files and
    converting each one. This option NO LONGER implies
    −text, like it used to, so to dump text, use −dump
    −text.

‐dumpprc
    Output the page(s) downloaded directly to stdout, in
    converted format as a PDB file (note: not PRC format!),
    suitable for installation to a Palm handheld.

‐nowrite
    Test mode ‐‐ do not write to the cache or already_seen
    file, instead write what would be written normally to a
    directory called new_cache and a new_already_seen file.
    This is very handy when writing a new site file.

‐badcache
    Send some HTTP headers to bypass web caching proxy
    servers.  This is generally useful if a web caching
    proxy server somewhere between sitescooper and the
    target site is returning out‐of‐date files.

‐debug
    Enable debugging output. This output is in addition to
    the usual progress messages.

‐quiet
    Process sites quietly, without printing the usual
    progress messages to STDERR. Warnings about incorrect
    site files and system errors will still be output,
    however.

‐admin cmd
    Perform an administrative command. This is intended to
    ease the task of writing scripts which use sitescooper
    output.  The following admin commands are available:

dump‐sites
        List the sites which would be scooped on a scooping









                             ‐7‐


        run, and their URLs.  Instead of scooping any sites,
        sitescooper will exit after performing this task.
        The format is one site per line, with the site file
        name first, a tab, the site’s URL, a tab, the site
        name, a tab, and the output filename that would be
        generated without path or extension. For example:

        foobar.site    http://www.foobar.com/   Foo Bar   1999_01_01_Foo_Bar

journal Write a journal with dumps of the documents as they
        pass through the formatting and stripping steps of
        the scooping process. This is written to a file
        called journal in the sitescooper temporary
        directory.

import‐cookies file
        Import a Netscape cookies file into sitescooper, so
        that certain sites which require them, can use them.
        For example, the site economist_full.site requires
        this. Here’s how to import cookies on a UNIX
        machine:

        sitescooper.pl −admin import‐
        cookies ~/.netscape/cookies

        and on Windows:

        perl sitescooper.pl −admin import‐cookies
          "C:\Program Files\Netscape\Users\Default\cookies.txt"

        Unfortunately, MS Internet Explorer cookies are
        currently unsupported.  If you wish to write a patch
        to support them, that’d be great.

‐noheaders
    Do not attach the sitescooper header (URL, site name,
    and navigation links) to each page.

‐nofooters
    Do not attach the sitescooper footer ("copyright
    retained by original authors" blurb) to each page.

‐outputtemplate file.tmpl
    Read the output formatting template from the file
    file.tmpl.  This overrides the settings of the
    −noheaders and −nofooters flags.  See the OUTPUT
    TEMPLATES section below for details on this.

‐fromcache
    Do not perform any network access, retrieve everything
    from the cache or the shared cache.

‐filename template
    Change the format of output filenames. template contains









                             ‐8‐


    the following keyword strings, which are substituted as
    follows:

YYYY    The current year, in 4‐digit format.

MM      The current month number (from 01 to 12), in 2‐digit
        format.

Mon     The current month name (from Jan to Dec), in
        3‐letter format.

DD      The current day of the month (from 01 to 31), in
        2‐digit format.

Day     The current day of the week (from Sun to Sat), in
        3‐letter format.

hh      The current hour (from 00 to 23), in 2‐digit format.

mm      The current minute (from 00 to 59), in 2‐digit
        format.

Site    The current site’s name.

Section The section of the current site (now obsolete).

        The default filename template is YYYY_MM_DD_Site.

‐prctitle template
    Change the format of the titles of the resulting PDB
    files. template may contain the same keyword strings as
    −filename.

    The default PDB title template is YYYY−Mon‐DD: Site.

‐nodates
    Do not put the date in the installable file’s filename.
    This allows you to automatically overwrite old files
    with new ones when you HotSync. It’s a compatibility
    shortcut for −filename Site −prctitle "Site".

‐preload preload_method
    Preload pages using the given preload method. Currently
    supported preload methods are:

lwp     Use the Perl LWP module to load pages. This is the
        default, and is single‐threaded; in other words,
        each page needs to load fully before the next page
        can be requested.

fork[n] Use a number of subprocesses running LWP requests to
        load pages.  This is multi‐threaded, and several
        pages can be loaded at once; however you pay in
        costs of network bandwidth, CPU time and memory









                             ‐9‐


        used. The optional n argument instructs sitescooper
        to use that number of processes; the default n is 4.
        This is only available on UNIX at the moment.

‐disc
    Disconnect a PPP connection once the scooping has
    finished.  Currently this code is experimental, and will
    probably only work on Macintoshes.  This is off by
    default.

‐stdout‐to file
    Redirect the output of sitescooper into the named file.
    This is needed on Windows NT and 95, where certain
    combinations of perl and Windows do not seem to support
    the > operator.

‐keep‐tmps
    Keep temporary files after conversion. Normally the .txt
    or .html rendition of a site is deleted after
    conversion; this option keeps it around.

You can control exactly what HTML or text is written to the
output file using the −outputtemplate argument.  This
argument takes the name of a file, which is read and parsed
to provide replacement templates for sitescooper.

     The file is read as a HTML− or XML−style tagged format;
so for example the template for the main page in HTML format
is read from between the <htmlmainpage> and
</htmlmainpage> tags. The templates that can be
defined are as follows:

htmlmainpage
    The main page, in HTML format; this is used when the
    −html output format, or one based on it (such as
    −plucker or −isilo), is used.  It is also used for the
    −mhtml format’s main (top‐level) page.

htmlsubpage
    Sub‐page, in HTML format; this is used for the −mhtml
    output format’s sub pages, ie. pages other than the top‐
    level one.

htmlstory
    The snippet of HTML encapsulating each story. This is
    included for each piece of snarfed text, in all HTML
    files.

textmainpage
    The main page, in text format; this is used when the
    −text output format, or one based on it (such as −doc),
    is used.











                            ‐10‐


textsubpage
    Sub‐page, in text format; this is currently unused.

textstory
    The snippet of text encapsulating each story. This is
    included for each piece of snarfed text, in all text‐
    format or DOC−format files.

     A sample template file is provided in the file
default_templates.html; this may have been installed in the
sitescooper install directory, /usr/share/sitescooper, or
/usr/local/share/sitescooper.  Note that the actual
templates used are not loaded from this file; instead they
are incorporated inside the sitescooper script, so changing
this file will have no effect.

To install, edit the script and change the #! line. You may
also need to (a) change the Pilot install dir if you plan to
use the pilot installation functionality, and (b) edit the
other parameters marked with CUSTOMISE in case they need to
be customised for your site. They should be set to
acceptable defaults (unless I forgot to comment out the
proxy server lines I use ;).



             sitescooper.pl http://www.ntk.net/

To snarf the ever‐cutting NTKnow newsletter.

             sitescooper.pl ‐refresh ‐html http://www.ntk.net/

To snarf NTKnow, ignoring any previously‐read text, and
producing HTML output.

             sitescooper.pl ‐refresh ‐html ‐site site_samples/tech/ntk.site

To snarf NTKnow using the site file provided with the main
distribution, producing HTML output.

sitescooper makes use of the $http_proxy environment
variable, if it is set.

Justin Mason <jm /at/ jmason.org>

Copyright (C) 1999‐2000 Justin Mason

     This program is free software; you can redistribute it
and/or modify it under the terms of the GNU General Public
License as published by the Free Software Foundation; either
version 2 of the License, or (at your option) any later
version.











                            ‐11‐


     This program is distributed in the hope that it will be
useful, but WITHOUT ANY WARRANTY; without even the implied
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.  See the GNU General Public License for more
details.

     You should have received a copy of the GNU General
Public License along with this program; if not, write to the
Free Software Foundation, Inc., 59 Temple Place − Suite 330,
Boston, MA  02111‐1307, USA, or read it on the web at
http://www.gnu.org/copyleft/gpl.html .

The CPAN script category for this script is Web. See
http://www.cpan.org/scripts/ .

File::Find File::Copy File::Path FindBin Carp Cwd URI::URL
LWP::UserAgent HTTP::Request::Common HTTP::Date
HTML::Entities

     All these can be picked up from CPAN at
http://www.cpan.org/ .  Note that HTML::Entities is actually
included in one of the previous packages, so you do not need
to install it separately.

Win32::TieRegistry will be used, if running on a Win32
platform, to find the Pilot Desktop software’s installation
directory. Algorithm::Diff to support diffing sites without
running an external diff application (this is required on
Mac systems).

Sitescooper downloads news stories from the web and converts
them to Palm handheld iSilo, DOC or text format for later
reading on‐the‐move.  Site files and full documentation can
be found at http://sitescooper.org/ .

into one of several formats suitable for viewing on a Palm
handheld."


























                            ‐12‐