diff

diff(3m)                      MBA Library Functions                     diff(3m)



NAME
       diff - compute a shortest edit script (SES) given two sequences

SYNOPSIS
       #include <mba/diff.h>


       typedef const void *(*idx_fn)(const void *s, int idx, void *context);
       typedef int (*cmp_fn)(const void *e1, const void *e2, void *context);

       typedef enum {
            DIFF_MATCH = 1,
            DIFF_DELETE,
            DIFF_INSERT
       } diff_op;

       struct diff_edit {
            short op; /* DIFF_MATCH, DIFF_DELETE or DIFF_INSERT */
            int off; /* off into a if MATCH or DELETE, b if INSERT */
            int len;
       };

       int diff(const void *a,
                   int aoff,
                   int n,
                   const void *b,
                   int boff,
                   int m,
                   idx_fn idx,
                   cmp_fn cmp,
                   void *context,
                   int dmax,
                   struct varray *ses,
                   int *sn,
                   struct varray *buf);


DESCRIPTION
       The diff(3m) module will compute theshortest edit script(SES) of two
       sequences. This algorithm is perhaps best known as the one used in GNU
       diff(1) although GNU diff employs additional optimizations specific to
       line oriented input such as source code files whereas this implementation
       is more generic. Formally, this implementation of the SES solution uses
       the dynamic programming algorithm described by Myers [1] with the
       Hirschberg linear space refinement. The objective is to compute the
       minimum set of edit operations necessary to transform a sequence A of
       length N into B of length M. This can be performed in O(N+M*D^2) expected
       time where D is theedit distance(number of elements deleted and inserted
       to transform A into B). Thus the algorithm is particularly fast and uses
       less space if the difference between sequences is small. The interface is
       generic such that sequences can be anything that can be indexed and
       compared with user supplied functions including arrays of structures,
       linked lists, arrays of pointers to strings in a file, etc.

       [1] E. Myers, ``An O(ND) Difference Algorithm and Its Variations,''
       Algorithmica 1, 2 (1986), 251-266.
       http://www.cs.arizona.edu/people/gene/PAPERS/diff.ps

       diff   The diff(3m) function computes the minimumedit distancebetween the
              sequences a (from aoff for length n) and b (from boff for length
              m) and optionally collects theedit scriptnecessary to transform a
              into b storing the result in the varray identified by the ses
              parameter. The idx paremeter is a pointer to an idx_fn that
              returns the element at the numeric index in a sequence. The cmp
              parameter is a pointer to a cmp_fn function that returns zero if
              the two elements e1 and e2 are equal and non-zero otherwise. The
              supplied context parameter will be passed directly to both
              functions with each invokation. If idx and cmp are NULL a and b
              will be compared as continuous sequences of unsigned char (i.e.
              raw memory diff).

              If the ses parameter is not NULL it must be a varray(3m) with a
              membsize of sizeof(struct diff_edit).  Each struct diff_edit
              element in the varray(3m) starting from 0 will be populated with
              the op, off, and len that together constitute the edit script. The
              number of struct diff_edit elements in the edit script is written
              to the integer pointed to by the sn parameter. If the ses or sn
              parameter is NULL, the edit script will not be collected.

              If the dmax parameter is not 0, the calculation will stop as soon
              as it is determined that the edit distance of the two sequences
              equals or exceeds the specified value. A value of 0 indicates that
              there is no limit.

              If the buf parameter is not NULL it must be a varray(3m) with
              membsize of sizeof(int) and will be used as temporary storage for
              the dynamic programming tables. If buf is NULL storage will be
              temporarily allocated and freed with malloc(3) and free(3).

RETURNS
       diff   The diff(3m) function returns the edit distance between the two
              sequences a and b or dmax if the edit distance has been determined
              to meet or exceed the specified dmax value. If an error occurs -1
              is returned and errno is set to indicate the error.



libmba-0.9.1                      Apr 29, 2005                          diff(3m)