afnix-txt






txt ‐ standard text processing module

The Standard Text Processingmodule is an original
implementation of an object collection dedicated to text
processing. Although text scaning is the current operation
perfomed in the field of text processing, the module
provides also specialized object to store and index text
data. Text sorting and transliteration is also part of this
module.

     Scanningconcepts
Text scanning is the ability to extract lexical elements or
lexemesfrom a stream. A scanner or lexical analyzer is the
principal object used to perform this task. A scanner is
created by adding special object that acts as a pattern
matcher. When a pattern is matched, a special object called
a lexemeis returned.

     Patternobject
A Patternobject is a special object that acts as model for
the string to match. There are several ways to build a
pattern. The simplest way to build it is with a regular
expression. Another type of pattern is a balanced pattern.
In its first form, a pattern object can be created with a
regular expression object.

# create a pattern object
const pat (afnix:txt:Pattern "$d+")

In this example, the pattern object is built to detect
integer objects.

pat:check "123" # true
pat:match "123" # 123

The checkmethod return true if the input string matches the
pattern. The matchmethod returns the string that matches the
pattern. Since the pattern object can also operates with
stream object, the matchmethod is appropriate to match a
particular string. The pattern object is, as usual,
available with the appropriate predicate.

afnix:txt:pattern‐p pat # true

Another form of pattern object is the balanced pattern. A
balanced pattern is determined by a starting string and an
ending string. There are two types of balanced pattern. One
is a single balanced pattern and the other one is the
recursive balanced pattern. The single balanced pattern is
appropriate for those lexical element that are defined by a
character. For example, the classical C‐string is a single
balanced pattern with the double quote character.

# create a balanced pattern









                             ‐2‐


const pat (afnix:txt:Pattern "ELEMENT" "<" ">")
pat:check "<xml>" # true
pat:match "<xml>" # xml

In the case of the C‐string, the pattern might be more
appropriately defined with an additional escape character.
Such character is used by the pattern matcher to grab
characters that might be part of the pattern definition.

# create a balanced pattern
const pat (afnix:txt:Pattern "STRING" "’" ’\’)
pat:check "’hello’" # true
pat:match "’hello’" # "hello"

In this form, a balanced pattern with an escape character is
created. The same string is used for both the starting and
ending string. Another constructor that takes two strings
can be used if the starting and ending strings are
different. The last pattern form is the balanced recursive
form. In this form, a starting and ending string are used to
delimit the pattern. However, in this mode, a recursive use
of the starting and ending strings is allowed. In order to
have an exact match, the number of starting string must
equal the number of ending string. For example, the C‐
comment pattern can be viewed as recursive balanced pattern.

# create a c‐comment pattern
const pat (afnix:txt:Pattern "STRING" "/*" "*/" )


     Lexemeobject
The Lexemeobject is the object built by a scanner that
contains the matched string. A lexeme is therefore a tagged
string. Additionally, a lexeme can carry additional
information like a source name and index.

# create an empty lexeme
const lexm (afnix:txt:Lexeme)
afnix:txt:lexeme‐p lexm # true

The default lexeme is created with any value. A value can be
set with the set‐valuemethod and retrieved with the get‐
valuemethods.

lexm:set‐value "hello"
lexm:get‐value # hello

Similar are the set‐tagand get‐tagmethods which operate with
an integer. The source name and index are defined as well
with the same methods.

# check for the source
lexm:set‐source "world"
lexm:get‐source # world









                             ‐3‐


# check for the source index
lexm:set‐index 2000
lexm:get‐index # 2000


     Textscanning
Text scanning is the ability to extract lexical elements or
lexemes from an input stream. Generally, the lexemes are the
results of a matching operation which is defined by a
pattern object. As a result, the definition of a scanner
object is the object itself plus one or several pattern
object.

     Scannerconstruction
By default, a scanner is created without pattern objects.
The lengthmethod returns the number of pattern objects. As
usual, a predicate is associated with the scanner object.

# the default scanner
const  scan (afnix:txt:Scanner)
afnix:txt:scanner‐p scan # true
# the length method
scan:length # 0

The scanner construction proceeds by adding pattern objects.
Each pattern can be created independently, and later added
to the scanner. For example, a scanner that reads real,
integer and string can be defined as follow:

# create the scanner pattern
const REAL    (
  afnix:txt:Pattern "REAL"    [$d+.$d*])
const STRING  (
  afnix:txt:Pattern "STRING"  """ ’\’)
const INTEGER (
  afnix:txt:Pattern "INTEGER" [$d+|"0x"$x+])
# add the pattern to the scanner
scanner:add INTEGER REAL STRING

The order of pattern integration defines the priority at
which a token is recognized. The symbol name for each
pattern is optional since the functional programming permits
the creation of patterns directly. This writing style makes
the scanner definition easier to read.

     scannerUsingthe
Once constructed, the scanner can be used as is. A stream is
generally the best way to operate. If the scanner reaches
the end‐of‐stream or cannot recognize a lexeme, the nil
object is returned. With a loop, it is easy to get all
lexemes.

while (trans valid (is:valid‐p)) {
  # try to get the lexeme









                             ‐4‐


  trans lexm (scanner:scan is)
  # check for nil lexeme and print the value
  if (not (nil‐p lexm)) (println (lexm:get‐value))
  # update the valid flag
  valid:= (and (is:valid‐p) (not (nil‐p lexm)))
}

In this loop, it is necessary first to check for the end of
the stream. This is done with the help of the special loop
construct that initialize the validsymbol. As soon as the
the lexeme is built, it can be used. The lexeme holds the
value as well as it tag.

     Textsorting
Sorting is one the primary function implemented inside the
text processingmodule. There are three sorting functions
available in the module.

     descendingAscendingand
The sort‐ascentfunction operates with a vector object and
sorts the elements in ascending order. Any kind of objects
can be sorted as long as they support a comparison method.
The elements are sorted in placed by using a quick
sortalgorithm.

# create an unsorted vector
const v‐i (Vector 7 5 3 4 1 8 0 9 2 6)
# sort the vector in place
afnix:txt:sort‐ascent v‐i
# print the vector
for (e) (v) (println e)

The sort‐descentfunction is similar to the sort‐
ascentfunction except that the object are sorted in
descending order.

     Lexicalsorting
The sort‐lexicalfunction operates with a vector object and
sorts the elements in ascending order using a lexicographic
ordering relation. Objects in the vector must be literal
objects or an exception is raised.

     Transliteration
Transliteration is the process of changing characters my
mapping one to another one. The transliteration process
operates with a character source and produces a target
character with the help of a mapping table. The
transliteration process is not necessarily reversible as
often indicated in the literature.

     Literateobject
The Literateobject is a transliteration object that is bound
by default with the identity function mapping. As usual, a
predicate is associate with the object.









                             ‐5‐


# create a transliterate object
const tl (afnix:txt:Literate)
# check the object
afnix:txt:literate‐p tl # true

The transliteration process can also operate with an escape
character in order to map double character sequence into a
single one, as usually found inside programming language.

# create a transliterate object by escape
const tl (afnix:txt:Literate ’\’)


     Transliterationconfiguration
The set‐mapconfigures the transliteration mapping table
while the set‐escape‐mapconfigure the escape mapping table.
The mapping is done by setting the source character and the
target character. For instance, if one want to map the
tabulation character to a white space, the mapping table is
set as follow:

tl:set‐map ’’ ’ ’

The escape mapping table operates the same way. It should be
noted that the mapping algorithm translate first the input
character, eventually yielding to an escape character and
then the escape mapping takes place. Note also that the set‐
escapemethod can be used to set the escape character.

tl:set‐map ’’ ’ ’


     Transliterationprocess
The transliteration process is done either with a string or
an input stream. In the first case, the translatemethod
operates with a string and returns a translated string. On
the other hand, the readmethod returns a character when
operating with a stream.

# set the mapping characters
tl:set‐map ’0
tl:set‐map ’’ ’
tl:set‐map ’
tl:set‐map ’’
# translate a string
tl:translate "helo" # word




     Pattern
The Patternclass is a pattern matching class based either on
regular expression or balanced string. In the regex mode,
the pattern is defined with a regex and a matching is said









                             ‐6‐


to occur when a regex match is achieved. In the balanced
string mode, the pattern is defined with a start pattern and
end pattern strings. The balanced mode can be a single or
recursive. Additionally, an escape character can be
associated with the class. A name and a tag is also bound to
the pattern object as a mean to ease the integration within
a scanner.

     Predicate

     pattern‐p

     Inheritance

     Object

     Constructors

     Pattern(none)
     The Patternconstructor creates an empty pattern.

     Pattern(String|Regex)
     The Patternconstructor creates a pattern object
     associated with a regular expression. The argument can
     be either a string or a regular expression object. If
     the argument is a string, it is converted into a
     regular expression object.

     String)Pattern(String
     The Patternconstructor creates a balanced pattern. The
     first argument is the start pattern string. The second
     argument is the end balanced string.

     StringPattern(String
     The Patternconstructor creates a balanced pattern with
     an escape character. The first argument is the start
     pattern string. The second argument is the end balanced
     string. The third character is the escape character.

     StringPattern(String
     The Patternconstructor creates a recursive balanced
     pattern. The first argument is the start pattern
     string. The second argument is the end balanced string.

     Constants

     REGEX
     The REGEXconstant indicates that the pattern is a
     regular expression.

     BALANCED
     The BALANCEDconstant indicates that the pattern is a
     balanced pattern.










                             ‐7‐


     RECURSIVE
     The RECURSIVEconstant indicates that the pattern is a
     recursive balanced pattern.

     Methods

     Booleancheck‐>
     The checkmethod checks the pattern against the input
     string. If the verification is successful, the method
     returns true, false otherwise.

     Stringmatch‐>
     The matchmethod attempts to match an input string or an
     input stream. If the matching occurs, the matching
     string is returned. If the input is a string, the end
     of string is used as an end condition. If the input
     stream is used, the end of stream is used as an end
     condition.

     noneset‐tag‐>
     The set‐tagmethod sets the pattern tag. The tag can be
     further used inside a scanner.

     Integerget‐tag‐>
     The get‐tagmethod returns the pattern tag.

     noneset‐name‐>
     The set‐namemethod sets the pattern name. The name is
     symbol identifier for that pattern.

     Stringget‐name‐>
     The get‐namemethod returns the pattern name.

     noneset‐regex‐>
     The set‐regexmethod sets the pattern regex either with
     a string or with a regex object. If the method is
     successfully completed, the pattern type is switched to
     the REGEX type.

     noneset‐escape‐>
     The set‐escapemethod sets the pattern escape character.
     The escape character is used only in balanced mode.

     Characterget‐escape‐>
     The get‐escapemethod returns the escape character.

     noneset‐balanced‐>
     The set‐balancedmethod sets the pattern balanced
     string. With one argument, the same balanced string is
     used for starting and ending. With two arguments, the
     first argument is the starting string and the second is
     the ending string.











                             ‐8‐


     Lexeme
The Lexemeclass is a literal object that is designed to hold
a matching pattern. A lexeme consists in string (i.e. the
lexeme value), a tag and eventually a source name (i.e. file
name) and a source index (line number).

     Predicate

     lexeme‐p

     Inheritance

     Literal

     Constructors

     Lexeme(none)
     The Lexemeconstructor creates an empty lexeme.

     Lexeme(String)
     The Lexemeconstructor creates a lexeme by value. The
     string argument is the lexeme value.

     Methods

     noneset‐tag‐>
     The set‐tagmethod sets the lexeme tag. The tag can be
     further used inside a scanner.

     Integerget‐tag‐>
     The get‐tagmethod returns the lexeme tag.

     noneset‐value‐>
     The set‐valuemethod sets the lexeme value. The lexeme
     value is generally the result of a matching operation.

     Stringget‐value‐>
     The get‐valuemethod returns the lexeme value.

     noneset‐index‐>
     The set‐indexmethod sets the lexeme source index. The
     lexeme source index can be for instance the source line
     number.

     Integerget‐index‐>
     The get‐indexmethod returns the lexeme source index.

     noneset‐source‐>
     The set‐sourcemethod sets the lexeme source name. The
     lexeme source name can be for instance the source file
     name.

     Stringget‐source‐>
     The get‐sourcemethod returns the lexeme source name.









                             ‐9‐


     Scanner
The Scannerclass is a text scanner or lexical analyzerthat
operates on an input stream and permits to match one or
several patterns. The scanner is built by adding patterns to
the scanner object. With an input stream, the scanner object
attempts to build a buffer that match at least one pattern.
When such matching occurs, a lexeme is built. When building
a lexeme, the pattern tag is used to mark the lexeme.

     Predicate

     scanner‐p

     Inheritance

     Object

     Constructors

     Scanner(none)
     The Scannerconstructor creates an empty scanner.

     Methods

     noneadd‐>
     The addmethod adds 0 or more pattern objects to the
     scanner. The priority of the pattern is determined by
     the order in which the patterns are added.

     Integerlength‐>
     The lengthmethod returns the number of pattern objects
     in this scanner.

     Patternget‐>
     The getmethod returns a pattern object by index.

     Lexemecheck‐>
     The checkmethod checks that a string is matched by the
     scanner and returns the associated lexeme.

     Lexemescan‐>
     The scanmethod scans an input stream until a pattern is
     matched. When a matching occurs, the associated lexeme
     is returned.

     Literate
The Literateclass is transliteration mapping class.
Transliteration is the process of changing characters my
mapping one to another one. The transliteration process
operates with a character source and produces a target
character with the help of a mapping table. This
transliteration object can also operate with an escape
table. In the presence of an escape character, an escape
mapping table is used instead of the regular one.









                            ‐10‐


     Predicate

     literate‐p

     Inheritance

     Object

     Constructors

     Literate(none)
     The Literateconstructor creates a default
     transliteration object.

     Literate(Character)
     The Literateconstructor creates a default
     transliteration object with an escape character. The
     argument is the escape character.

     Methods

     Characterread‐>
     The readmethod reads a character from the input stream
     and translate it with the help of the mapping table. A
     second character might be consumed from the stream if
     the first character is an escape character.

     Charactergetu‐>
     The getumethod reads a Unicode character from the input
     stream and translate it with the help of the mapping
     table. A second character might be consumed from the
     stream if the first character is an escape character.

     nonereset‐>
     The resetmethod resets all the mapping table and
     install a default identity one.

     noneset‐map‐>
     The set‐mapmethod set the mapping table by using a
     source and target character. The first character is the
     source character. The second character is the target
     character.

     Characterget‐map‐>
     The get‐mapmethod returns the mapping character by
     character. The source character is the argument.

     Stringtranslate‐>
     The translatemethod translate a string by
     transliteration and returns a new string.

     noneset‐escape‐>
     The set‐escapemethod set the escape character.










                            ‐11‐


     Characterget‐escape‐>
     The get‐escapemethod returns the escape character.

     noneset‐escape‐map‐>
     The set‐escape‐mapmethod set the escape mapping table
     by using a source and target character. The first
     character is the source character. The second character
     is the target character.

     Characterget‐escape‐map‐>
     The get‐escape‐mapmethod returns the escape mapping
     character by character. The source character is the
     argument.

     Functions

     nonesort‐ascent‐>
     The sort‐ascentfunction sorts in ascending order the
     vector argument. The vector is sorted in place.

     nonesort‐descent‐>
     The sort‐descentfunction sorts in descending order the
     vector argument. The vector is sorted in place.

     nonesort‐lexical‐>
     The sort‐lexicalfunction sorts in lexicographic order
     the vector argument. The vector is sorted in place.