uri

uri(3)                      Library Functions Manual                      uri(3)



NAME
       uri - a set of functions to manipulate URIs


DESCRIPTION
       The header file for the library is #include <uri.h> and the library may
       be linked using -luri.

       uri is a library that analyses URIs and transform them. It is designed to
       be fast and occupy as few memory as possible. The basic usage of this
       library is to transform an URI into a structure with one field for each
       component of the URI and vice versa.


LIBRARY MODE
       The library behaviour is controled by the flags described bellow. The
       default set of flag is URI_MODE_CANNONICAL|URI_MODE_ERROR_STDERR.


       URI_MODE_CANNONICAL
              All objects store URI in cannonical form.


       URI_MODE_LOWER_SCHEME
              The scheme of the URI is always converted to lower case.


       URI_MODE_ERROR_STDERR
              If an error occurs, the error string is printed on the STDERR
              chanel.


       URI_MODE_FIELD_MALLOC
              Each field may have its own malloc'd space. When the caller set a
              field it can assume the content of the field is saved in the
              object. Otherwise when the caller sets a field it must make sure
              that the memory containing the value of the field will not be
              freed before the object is deallocated.


       URI_MODE_FURI_MD5
              Use MD5 key calculated from the URL as a path name instead of the
              readable path name described in FURI chapter below.  For example
              http://www.foo.com/ is transformed into the MD5 key
              33024cec6160eafbd2717e394b5bc201 and the corresponding FURI is
              33/02/4c/ec6160eafbd2717e394b5bc201.


       URI_MODE_URI_STRICT
              Behave in strict mode (see STRICTNESS below).


       URI_MODE_URI_STRICT_SCHEME
              Behave in strict mode (see STRICTNESS below).


       URI_MODE_FLAG_DEFAULT
              The default mode of the library.


STRUCTURE AND ALLOCATION
       The uri_t type is a structure describing the URI.  Access functions are
       provided and should be used to get the values of the fields and set new
       values.  All the fields are character strings whose size is exactly the
       size of the string they contain. One can safely override the values
       contained in the fields, as long as the replacement string has a size
       lower or equal to the original size. If the replacement string is larger,
       the caller must use a buffer of its own.

       If the flag URI_MODE_FIELD_MALLOC is not set, which is the default, the
       allocation policy for an uri_t object is minimal. When an object is
       allocated using uri_alloc, memory is allocated by the library to store
       the object. This memory will be released when the object is freed using
       uri_free.  When a field is set, the pointer is stored in the object and
       no copy of the string is kept. It is the responsibility of the caller to
       make sure that the string will live as long as the object lives. This
       policy is designed to prevent allocation as much as possible. Let's say
       you have a program that will operate on 50 000 URLs, only one malloc and
       a few realloc will be necessary instead of 50 000 malloc/free multiplied
       by the number of fields of the structure.  The loop will look like this:
            /*
                * Alloc an empty object.
             */
            uri_t* uri = uri_alloc_1();

            for(i = 0; i < 50000; i++) {
               /*
                * Reuse the object for another url, object grow
                * only if needed because the url is larger than
                * any previously seen url.
                */
               uri_realloc(uri, url[i], strlen(url[i]));
               ... do something on uri ...
               /*
                * Print the url on stdout
                */
               printf("%s\n", uri_uri(uri));
            }

       If the flag URI_MODE_FIELD_MALLOC is set, each field will have a
       separatly allocated space, if necessary. The caller may assume that the
       object is always self contained and does not depend on externally
       allocated string. Each set function (uri_scheme_set, uri_host_set etc.)
       allocated the necessary space and duplicate the string given in argument.
       The info field contains flags that record which fields contain a malloc'd
       space and which does not (URI_INFO_M_* flags). This information is only
       valid between two calls of the library functions. For instance
       uri_cannonicalize will reorganize allocated space. This policy is used
       for integration of the library into scripting langages such as Perl.


       info   A bit field carrying information about the URI. Each bit has a
              corresponding define that have the following meaning.


       URI_INFO_CANNONICAL Set if the URI is in cannonical form.


       URI_INFO_RELATIVE Set if the URI is a relative URI (does not start with
       {http,..}://).


       URI_INFO_RELATIVE_PATH Set if the URI is a relative URI and the path does
       not start with a /.


       URI_INFO_PARSED Set if the URI was successfully parsed. If this flag is
       not set the content of the object is undefined.


       URI_INFO_ROBOTS Set if the URI is an http robots.txt file.


       URI_INFO_M_* There is such a flag for each field of the uri_t structure.
       If the flag is set, the memory pointed by this field has been allocated
       by malloc.


       scheme The scheme of the URI (http, ftp, file or news).


       host   The host name part of the URI.


       port   The port number associated to host, if any.


       path   The path name of the URI.


       params The parameters of the URI (i.e. what is found after the ; in the
              path).


       query  The query part of a cgi-bin call (i.e. what is found after the ?
              in the path).


       frag   The fragement of the document (i.e. what is found after the # in
              the path).


       user   If authentication information is set, the user name.


       passwd If authentication information is set, the password.


FUNCTIONS
       uri_t* uri_alloc_1()
              Allocate an empty object that must be filled with the uri_realloc
              function.


       uri_t* uri_alloc(char* uri, int uri_length)
              The uri is splitted into fields and the corresponding uri_t
              structure is returned. The structure is allocated using malloc.
              The URI is put in cannonical form. If it cannot be put in
              cannonical form an error message is printed on stderr and a null
              pointer is returned.


       uri_t* uri_object(char* uri, int uri_length)
              The uri is splitted into fields and the corresponding uri_t
              structure is returned.  The returned structure is statically
              allocated and must not be freed.  The URI is put in cannonical
              form. If it cannot be put in cannonical form an error message is
              printed on stderr and a null pointer is returned.


       int uri_realloc(uri_t* object, char* uri, int uri_length)
              The uri is splitted into fields in the previously allocated object
              structure. The URI is put in cannonical form and URI_CANNONICAL is
              returned.  If it cannot be put in cannonical form, nothing is done
              and URI_NOT_CANNONICAL is returned.


       void uri_free(uri_t* object)
              The object previously allocated by uri_alloc is deallocated.


       uri_t* uri_abs(uri_t* base, char* relative_string, int relative_length)
              Transform the relative URI relative_string into an absolute URI
              using base as the base URI. The returned uri_t object is allocated
              statically and must not be freed.


       uri_abs_1(uri_t* base, uri_t* relative)
              Transform the relative URI relative into an absolute URI using
              base as the base URI. The returned uri_t object is allocated
              statically and must not be freed.


       int uri_info(uri_t* object)
              returns the content of the info field.


       char* uri_scheme(uri_t* object)
              returns the content of the scheme field.


       char* uri_host(uri_t* object)
              returns the content of the host field.


       char* uri_port(uri_t* object)
              returns the value of the port field of the object.  If the port
              field is empty, returns the default port for the corresponding
              scheme.  For instance, if the scheme is http the 80 string is
              returned.  The returned string is statically allocated and must
              not be freed.


       char* uri_path(uri_t* object)
              returns the content of the path field.


       char* uri_params(uri_t* object)
              returns the content of the params field.


       char* uri_query(uri_t* object)
              returns the content of the path field.


       char* uri_frag(uri_t* object)
              returns the content of the frag field.


       char* uri_user(uri_t* object)
              returns the content of the user field.


       char* uri_passwd(uri_t* object)
              returns the content of the passwd field.


       char* uri_netloc(uri_t* object)
              returns a concatenation of the host and port field, separated by a
              :.  If the host field is not set, the null pointer is returned and
              a message is printed on stderr.  The returned string is statically
              allocated and must not be freed.


       char* uri_auth_netloc(uri_t* object)
              returns a concatenation of the host and port field, separated by a
              :.  If the user field is set, the user and passwd fields are
              prepended to the netloc, separated by a @.  If the host field is
              not set, the null pointer is returned and error condition is set.
              The returned string is statically allocated and must not be freed.


       char* uri_auth(uri_t* object)
              returns a concatenation of the user and passwd field, separated by
              a : or an empty string if any of them is not set.  The returned
              string is statically allocated and must not be freed.


       char* uri_all_path(uri_t* object)
              returns a concatenation of the path, params and query fields in
              the form /path;params?query. Note that a leading slash is only
              prepended to the returned value if the object is not a relative
              URI.  The returned string is statically allocated and must not be
              freed.


       void uri_info_set(uri_t* object, int value)
              set the info field to value.


       void uri_scheme_set(uri_t* object, char* value)
              set the scheme field to value. The URI_INFO_RELATIVE is updated
              according to the new value.


       void uri_host_set(uri_t* object, char* value)
              set the host field to value. The URI_INFO_RELATIVE is updated
              according to the new value.


       void uri_params_set(uri_t* object, char* value)
              set the params field to value.


       void uri_query_set(uri_t* object, char* value)
              set the query field to value.


       void uri_user_set(uri_t* object, char* value)
              set the user field to value.


       void uri_passwd_set(uri_t* object, char* value)
              set the passwd field to value.


       void uri_copy(uri_t* to, uri_t* from)
              copy the content of object from into object to.


       uri_t* uri_clone(uri_t* from)
              creates a new object containing the same data as from.  The
              returned object must be freed using uri_free.


       void uri_clear(uri_t* object)
              clear all information contained in object.


       void uri_set_root(const char* root)
              Set the path that uri_furi will prepend to the FURI. By default it
              is the empty string.


       const char* uri_get_root()
              Get the path set by uri_set_root or empty string.


       char* uri_furi(uri_t* object)
              returns a string containing the FURI (File equivalent of an URI)
              built from object.  The returned string is statically allocated
              and must not be freed.


       char* uri_uri(uri_t* object)
              returns a string containing the URI built from object.  The
              returned string is statically allocated and must not be freed.


       void uri_string(uri_t* object, char** stringp, int* string_sizep, int
       flags)
              Build a string representation of object in stringp according to
              flags.  Possible values of flags is described in the
              uri_cannonicalize_string function.  Upon return the stringp
              pointer points to a static array of stringp_size bytes allocated
              with malloc. If stringp is not null it must point to a buffer
              allocated with malloc and is reallocated to fit the needs of the
              string conversion. This function is the backend of all object to
              string translation functions.


       char* uri_escape(char* string, char* range)
              return a statically allocated copy of string with all characters
              found in the the range string transformed in escaped form (%xx).
              A few examples of range argument are defined: URI_ESCAPE_RESERVED,
              URI_ESCAPE_PATH, URI_ESCAPE_QUERY, and uri_escape_unsafe.


       char* uri_unescape(char* string)
              return a statically allocated copy of string with all escape
              sequences (%xx) transformed to characters.


       char* uri_cannonicalize_string(char* uri, int uri_length, int flag)
              returns the cannonical form of the uri given in argument. The
              cannonical form is formatted according to the value of flag.
              Values of flag are bits that can be ored together.

              URI_STRING_FURI_STYLE return a FURI, URI_STRING_URI_STYLE return
              an URI, URI_STRING_ROBOTS_STYLE return the corresponding
              robots.txt URI, URI_STRING_URI_NOHASH_STYLE do not include the
              frag in the returned string.

              Returns 0 if uri is malformed.


       uri_t* uri_cannonical(uri_t* object)
              returns an object containing the cannonical form of object.  If
              the URI_MODE_CANNONICAL flag is set, the object itself is
              returned.


       int uri_consistent(uri_t* object)
              Returns 0 if object contains unparsable URL, returns != 0 if
              object contains a well formed URL. Must be called after a set of
              field changes to reset flags and ensure that modified URL is well
              formed.


HTTP FUNCTIONS
       char* uri_robots(uri_t* object)
              returns a string containing the URI of the robots.txt file
              corresponding to the URI contained in object. For instance, if the
              URI contained in object is http://www.foo.com/dir/dir/file.html
              the returned string will be http://www.foo.com/robots.txt.  The
              returned string is statically allocated and must not be freed.


CANNONICAL FORM
       The cannonical form of an URI is an arbitrary choice to code all the
       possible variations of the same URI in one string. For instance
       http://www.foo.com/abc"def.html will be transformed to
       http://www.foo.com/abc%22def.html. Most of the transformations follow the
       instructions found in draft-fielding-uri-syntax-04 but some of them
       don't.

       Additionally, when the path of the URI contains dots and double dots, it
       is reduced. For instance http://www.foo.com/dir/.././file.html will be
       transformed to http://www.foo.com/file.html.

       If the URI_MODE_CANNONICAL flag is set, the uri_t object always contains
       the cannonical form of the URL. The original form is lost.

       If the URI_MODE_CANNONICAL flag is not set, the cannonical form of the
       URI is stored in a separate object. The uri_t object contains the
       original form of the URI. It takes more memory to store but may be
       usefull in some situations.


ERROR HANDLING
       When an error occurs (URI cannot be cannonicalized or parsed, for
       instance), the global variable uri_errstr contains the full text of the
       error message. This variable is never reset by the library functions if
       no error occurs.

       Additionally, the error string may be printed on the error chanel
       (STDERR) if the URI_MODE_ERROR_STDERR flag is set. This is the default.


STRICTNESS
       The draft describing URI syntax (draft-fielding-uri-syntax-04) specifies
       that an URI of the type http:g may be interpreted in two different ways.
       If the URI_MODE_URI_STRICT flag is set, the library interprets it as an
       absolute URI, otherwise it is a relative URI.

       If the URI_MODE_URI_STRICT is not set, the URI_MODE_URI_STRICT_SCHEME may
       be set so that a relative URI containing a scheme is interpreted as an
       absolute URI only if the scheme is different from the scheme of the base
       URI.


FURI
       It is sometimes convinient to convert an URI into a path name. Some
       functions of the uri library provide such a conversion (uri_furi for
       instance). These path names are called FURI (File equivalent of an URI)
       for short. Here is a description of the transformation.
        http://www.ina.fr:700/imagina/index.html#queau
          |    \____________/ \________________/\____/
          |          |              |               lost
          |          |              |
          |          |              |
         /           |              |
         |           |              |
         |           |              |
         |           |              |
        /            |              |
        |   /^^^^^^^^^^^^^\/^^^^^^^^^^^^^^^^\
       http/www.ina.fr:700/imagina/index.html


EXAMPLES
       Show cannonical form of URI
       char* uri = "http://www.foo.com/";
       uri = uri_cannonicalize_string(uri, strlen(uri), URI_STRING_URI_STYLE);
       if(uri) printf("uri = %s\n", uri);

       Show the host and port of URI (netloc)
       char* uri = "http://www.foo.com:7000/";
       uri_t* uri_object = uri_object(uri, strlen(uri));
       if(uri_object) printf("netloc = %s\n", uri_netloc(uri_object));

       Change the query part of URI and show it
       char* uri = "http://www.foo.com/cgi-bin/bar?param=1";
       uri_t* uri_object = uri_object(uri, strlen(uri));
       if(uri_object) {
            uri_query_set(uri_object, "param=2");
            printf("uri = %s\n", uri_uri(uri_object));
       }


ADDING NEW SCHEMES
       Add the name of the scheme in the SCHEMES file. If nothing else this will
       bind the scheme to a generic parser following the URI parsing rules.  If
       you want to define specific behaviour for this scheme, mimic the
       uri_scheme_http.c file and recompile. If gperf(1) complains because it
       has conflicts you'll have to play with the -k option in order to find a
       working range that does not conflict and takes a few space as possible.


AUTHOR
       Loic Dachary loic@senga.org

SEE ALSO
       draft-fielding-uri-syntax-04



                                      local                               uri(3)