uri

uri(3)                     Library Functions Manual                     uri(3)



NAME
       uri - a set of functions to manipulate URIs


DESCRIPTION
       The header file for the library is #include <uri.h> and the library may
       be linked using -luri.

       uri is a library that analyses URIs and transform them. It is designed
       to be fast and occupy as few memory as possible. The basic usage of
       this library is to transform an URI into a structure with one field for
       each component of the URI and vice versa.


LIBRARY MODE
       The library behaviour is controled by the flags described bellow. The
       default set of flag is URI_MODE_CANNONICAL|URI_MODE_ERROR_STDERR.


       URI_MODE_CANNONICAL
              All objects store URI in cannonical form.


       URI_MODE_LOWER_SCHEME
              The scheme of the URI is always converted to lower case.


       URI_MODE_ERROR_STDERR
              If an error occurs, the error string is printed on the STDERR
              chanel.


       URI_MODE_FIELD_MALLOC
              Each field may have its own malloc'd space. When the caller set
              a field it can assume the content of the field is saved in the
              object. Otherwise when the caller sets a field it must make sure
              that the memory containing the value of the field will not be
              freed before the object is deallocated.


       URI_MODE_FURI_MD5
              Use MD5 key calculated from the URL as a path name instead of
              the readable path name described in FURI chapter below.  For
              example http://www.foo.com/ is transformed into the MD5 key
              33024cec6160eafbd2717e394b5bc201 and the corresponding FURI is
              33/02/4c/ec6160eafbd2717e394b5bc201.


       URI_MODE_URI_STRICT
              Behave in strict mode (see STRICTNESS below).


       URI_MODE_URI_STRICT_SCHEME
              Behave in strict mode (see STRICTNESS below).


       URI_MODE_FLAG_DEFAULT
              The default mode of the library.


STRUCTURE AND ALLOCATION
       The uri_t type is a structure describing the URI.  Access functions are
       provided and should be used to get the values of the fields and set new
       values.  All the fields are character strings whose size is exactly the
       size of the string they contain. One can safely override the values
       contained in the fields, as long as the replacement string has a size
       lower or equal to the original size. If the replacement string is
       larger, the caller must use a buffer of its own.

       If the flag URI_MODE_FIELD_MALLOC is not set, which is the default, the
       allocation policy for an uri_t object is minimal. When an object is
       allocated using uri_alloc, memory is allocated by the library to store
       the object. This memory will be released when the object is freed using
       uri_free.  When a field is set, the pointer is stored in the object and
       no copy of the string is kept. It is the responsibility of the caller
       to make sure that the string will live as long as the object lives.
       This policy is designed to prevent allocation as much as possible.
       Let's say you have a program that will operate on 50 000 URLs, only one
       malloc and a few realloc will be necessary instead of 50 000
       malloc/free multiplied by the number of fields of the structure.  The
       loop will look like this:
            /*
                * Alloc an empty object.
             */
            uri_t* uri = uri_alloc_1();

            for(i = 0; i < 50000; i++) {
               /*
                * Reuse the object for another url, object grow
                * only if needed because the url is larger than
                * any previously seen url.
                */
               uri_realloc(uri, url[i], strlen(url[i]));
               ... do something on uri ...
               /*
                * Print the url on stdout
                */
               printf("%s\n", uri_uri(uri));
            }

       If the flag URI_MODE_FIELD_MALLOC is set, each field will have a
       separatly allocated space, if necessary. The caller may assume that the
       object is always self contained and does not depend on externally
       allocated string. Each set function (uri_scheme_set, uri_host_set etc.)
       allocated the necessary space and duplicate the string given in
       argument. The info field contains flags that record which fields
       contain a malloc'd space and which does not (URI_INFO_M_* flags). This
       information is only valid between two calls of the library functions.
       For instance uri_cannonicalize will reorganize allocated space. This
       policy is used for integration of the library into scripting langages
       such as Perl.


       info   A bit field carrying information about the URI. Each bit has a
              corresponding define that have the following meaning.


       URI_INFO_CANNONICAL Set if the URI is in cannonical form.


       URI_INFO_RELATIVE Set if the URI is a relative URI (does not start with
       {http,..}://).


       URI_INFO_RELATIVE_PATH Set if the URI is a relative URI and the path
       does not start with a /.


       URI_INFO_PARSED Set if the URI was successfully parsed. If this flag is
       not set the content of the object is undefined.


       URI_INFO_ROBOTS Set if the URI is an http robots.txt file.


       URI_INFO_M_* There is such a flag for each field of the uri_t
       structure. If the flag is set, the memory pointed by this field has
       been allocated by malloc.


       scheme The scheme of the URI (http, ftp, file or news).


       host   The host name part of the URI.


       port   The port number associated to host, if any.


       path   The path name of the URI.


       params The parameters of the URI (i.e. what is found after the ; in the
              path).


       query  The query part of a cgi-bin call (i.e. what is found after the ?
              in the path).


       frag   The fragement of the document (i.e. what is found after the # in
              the path).


       user   If authentication information is set, the user name.


       passwd If authentication information is set, the password.


FUNCTIONS
       uri_t* uri_alloc_1()
              Allocate an empty object that must be filled with the
              uri_realloc function.


       uri_t* uri_alloc(char* uri, int uri_length)
              The uri is splitted into fields and the corresponding uri_t
              structure is returned. The structure is allocated using malloc.
              The URI is put in cannonical form. If it cannot be put in
              cannonical form an error message is printed on stderr and a null
              pointer is returned.


       uri_t* uri_object(char* uri, int uri_length)
              The uri is splitted into fields and the corresponding uri_t
              structure is returned.  The returned structure is statically
              allocated and must not be freed.  The URI is put in cannonical
              form. If it cannot be put in cannonical form an error message is
              printed on stderr and a null pointer is returned.


       int uri_realloc(uri_t* object, char* uri, int uri_length)
              The uri is splitted into fields in the previously allocated
              object structure. The URI is put in cannonical form and
              URI_CANNONICAL is returned.  If it cannot be put in cannonical
              form, nothing is done and URI_NOT_CANNONICAL is returned.


       void uri_free(uri_t* object)
              The object previously allocated by uri_alloc is deallocated.


       uri_t* uri_abs(uri_t* base, char* relative_string, int relative_length)
              Transform the relative URI relative_string into an absolute URI
              using base as the base URI. The returned uri_t object is
              allocated statically and must not be freed.


       uri_abs_1(uri_t* base, uri_t* relative)
              Transform the relative URI relative into an absolute URI using
              base as the base URI. The returned uri_t object is allocated
              statically and must not be freed.


       int uri_info(uri_t* object)
              returns the content of the info field.


       char* uri_scheme(uri_t* object)
              returns the content of the scheme field.


       char* uri_host(uri_t* object)
              returns the content of the host field.


       char* uri_port(uri_t* object)
              returns the value of the port field of the object.  If the port
              field is empty, returns the default port for the corresponding
              scheme.  For instance, if the scheme is http the 80 string is
              returned.  The returned string is statically allocated and must
              not be freed.


       char* uri_path(uri_t* object)
              returns the content of the path field.


       char* uri_params(uri_t* object)
              returns the content of the params field.


       char* uri_query(uri_t* object)
              returns the content of the path field.


       char* uri_frag(uri_t* object)
              returns the content of the frag field.


       char* uri_user(uri_t* object)
              returns the content of the user field.


       char* uri_passwd(uri_t* object)
              returns the content of the passwd field.


       char* uri_netloc(uri_t* object)
              returns a concatenation of the host and port field, separated by
              a :.  If the host field is not set, the null pointer is returned
              and a message is printed on stderr.  The returned string is
              statically allocated and must not be freed.


       char* uri_auth_netloc(uri_t* object)
              returns a concatenation of the host and port field, separated by
              a :.  If the user field is set, the user and passwd fields are
              prepended to the netloc, separated by a @.  If the host field is
              not set, the null pointer is returned and error condition is
              set.  The returned string is statically allocated and must not
              be freed.


       char* uri_auth(uri_t* object)
              returns a concatenation of the user and passwd field, separated
              by a : or an empty string if any of them is not set.  The
              returned string is statically allocated and must not be freed.


       char* uri_all_path(uri_t* object)
              returns a concatenation of the path, params and query fields in
              the form /path;params?query. Note that a leading slash is only
              prepended to the returned value if the object is not a relative
              URI.  The returned string is statically allocated and must not
              be freed.


       void uri_info_set(uri_t* object, int value)
              set the info field to value.


       void uri_scheme_set(uri_t* object, char* value)
              set the scheme field to value. The URI_INFO_RELATIVE is updated
              according to the new value.


       void uri_host_set(uri_t* object, char* value)
              set the host field to value. The URI_INFO_RELATIVE is updated
              according to the new value.


       void uri_params_set(uri_t* object, char* value)
              set the params field to value.


       void uri_query_set(uri_t* object, char* value)
              set the query field to value.


       void uri_user_set(uri_t* object, char* value)
              set the user field to value.


       void uri_passwd_set(uri_t* object, char* value)
              set the passwd field to value.


       void uri_copy(uri_t* to, uri_t* from)
              copy the content of object from into object to.


       uri_t* uri_clone(uri_t* from)
              creates a new object containing the same data as from.  The
              returned object must be freed using uri_free.


       void uri_clear(uri_t* object)
              clear all information contained in object.


       void uri_set_root(const char* root)
              Set the path that uri_furi will prepend to the FURI. By default
              it is the empty string.


       const char* uri_get_root()
              Get the path set by uri_set_root or empty string.


       char* uri_furi(uri_t* object)
              returns a string containing the FURI (File equivalent of an URI)
              built from object.  The returned string is statically allocated
              and must not be freed.


       char* uri_uri(uri_t* object)
              returns a string containing the URI built from object.  The
              returned string is statically allocated and must not be freed.


       void uri_string(uri_t* object, char** stringp, int* string_sizep, int
       flags)
              Build a string representation of object in stringp according to
              flags.  Possible values of flags is described in the
              uri_cannonicalize_string function.  Upon return the stringp
              pointer points to a static array of stringp_size bytes allocated
              with malloc. If stringp is not null it must point to a buffer
              allocated with malloc and is reallocated to fit the needs of the
              string conversion. This function is the backend of all object to
              string translation functions.


       char* uri_escape(char* string, char* range)
              return a statically allocated copy of string with all characters
              found in the the range string transformed in escaped form (%xx).
              A few examples of range argument are defined:
              URI_ESCAPE_RESERVED, URI_ESCAPE_PATH, URI_ESCAPE_QUERY, and
              uri_escape_unsafe.


       char* uri_unescape(char* string)
              return a statically allocated copy of string with all escape
              sequences (%xx) transformed to characters.


       char* uri_cannonicalize_string(char* uri, int uri_length, int flag)
              returns the cannonical form of the uri given in argument. The
              cannonical form is formatted according to the value of flag.
              Values of flag are bits that can be ored together.

              URI_STRING_FURI_STYLE return a FURI, URI_STRING_URI_STYLE return
              an URI, URI_STRING_ROBOTS_STYLE return the corresponding
              robots.txt URI, URI_STRING_URI_NOHASH_STYLE do not include the
              frag in the returned string.

              Returns 0 if uri is malformed.


       uri_t* uri_cannonical(uri_t* object)
              returns an object containing the cannonical form of object.  If
              the URI_MODE_CANNONICAL flag is set, the object itself is
              returned.


       int uri_consistent(uri_t* object)
              Returns 0 if object contains unparsable URL, returns != 0 if
              object contains a well formed URL. Must be called after a set of
              field changes to reset flags and ensure that modified URL is
              well formed.


HTTP FUNCTIONS
       char* uri_robots(uri_t* object)
              returns a string containing the URI of the robots.txt file
              corresponding to the URI contained in object. For instance, if
              the URI contained in object is
              http://www.foo.com/dir/dir/file.html the returned string will be
              http://www.foo.com/robots.txt.  The returned string is
              statically allocated and must not be freed.


CANNONICAL FORM
       The cannonical form of an URI is an arbitrary choice to code all the
       possible variations of the same URI in one string. For instance
       http://www.foo.com/abc"def.html will be transformed to
       http://www.foo.com/abc%22def.html. Most of the transformations follow
       the instructions found in draft-fielding-uri-syntax-04 but some of them
       don't.

       Additionally, when the path of the URI contains dots and double dots,
       it is reduced. For instance http://www.foo.com/dir/.././file.html will
       be transformed to http://www.foo.com/file.html.

       If the URI_MODE_CANNONICAL flag is set, the uri_t object always
       contains the cannonical form of the URL. The original form is lost.

       If the URI_MODE_CANNONICAL flag is not set, the cannonical form of the
       URI is stored in a separate object. The uri_t object contains the
       original form of the URI. It takes more memory to store but may be
       usefull in some situations.


ERROR HANDLING
       When an error occurs (URI cannot be cannonicalized or parsed, for
       instance), the global variable uri_errstr contains the full text of the
       error message. This variable is never reset by the library functions if
       no error occurs.

       Additionally, the error string may be printed on the error chanel
       (STDERR) if the URI_MODE_ERROR_STDERR flag is set. This is the default.


STRICTNESS
       The draft describing URI syntax (draft-fielding-uri-syntax-04)
       specifies that an URI of the type http:g may be interpreted in two
       different ways. If the URI_MODE_URI_STRICT flag is set, the library
       interprets it as an absolute URI, otherwise it is a relative URI.

       If the URI_MODE_URI_STRICT is not set, the URI_MODE_URI_STRICT_SCHEME
       may be set so that a relative URI containing a scheme is interpreted as
       an absolute URI only if the scheme is different from the scheme of the
       base URI.


FURI
       It is sometimes convinient to convert an URI into a path name. Some
       functions of the uri library provide such a conversion (uri_furi for
       instance). These path names are called FURI (File equivalent of an URI)
       for short. Here is a description of the transformation.
        http://www.ina.fr:700/imagina/index.html#queau
          |    \____________/ \________________/\____/
          |          |              |               lost
          |          |              |
          |          |              |
         /           |              |
         |           |              |
         |           |              |
         |           |              |
        /            |              |
        |   /^^^^^^^^^^^^^\/^^^^^^^^^^^^^^^^\
       http/www.ina.fr:700/imagina/index.html


EXAMPLES
       Show cannonical form of URI
       char* uri = "http://www.foo.com/";
       uri = uri_cannonicalize_string(uri, strlen(uri), URI_STRING_URI_STYLE);
       if(uri) printf("uri = %s\n", uri);

       Show the host and port of URI (netloc)
       char* uri = "http://www.foo.com:7000/";
       uri_t* uri_object = uri_object(uri, strlen(uri));
       if(uri_object) printf("netloc = %s\n", uri_netloc(uri_object));

       Change the query part of URI and show it
       char* uri = "http://www.foo.com/cgi-bin/bar?param=1";
       uri_t* uri_object = uri_object(uri, strlen(uri));
       if(uri_object) {
            uri_query_set(uri_object, "param=2");
            printf("uri = %s\n", uri_uri(uri_object));
       }


ADDING NEW SCHEMES
       Add the name of the scheme in the SCHEMES file. If nothing else this
       will bind the scheme to a generic parser following the URI parsing
       rules.  If you want to define specific behaviour for this scheme, mimic
       the uri_scheme_http.c file and recompile. If gperf(1) complains because
       it has conflicts you'll have to play with the -k option in order to
       find a working range that does not conflict and takes a few space as
       possible.


AUTHOR
       Loic Dachary loic@senga.org

SEE ALSO
       draft-fielding-uri-syntax-04



                                     local                              uri(3)