utf






UTF, Unicode, ASCII, rune − character set and format

The Plan 9 character set and representation are based on
Unicode and on a proposed X‐Open multibyte (File System Safe
Universal Character Set Transformation Format) encoding.
Unicode represents its characters in 16 bits; or just
represent such values in an 8‐bit byte stream.

     In Plan 9, a rune is a 16‐bit quantity representing a
Unicode character.  Internally, programs may store
characters as runes.  However, any external manifestation of
textual information, in files or at the interface between
programs, uses a machine‐independent, byte‐stream encoding
called

     is designed so the 7‐bit set (values hexadecimal 00 to
7F), appear only as themselves in the encoding.  Runes with
values above 7F appear as sequences of two or more bytes
with values only from 80 to FF.

     The encoding of Unicode is backward compatible with :
programs presented only with work on Plan 9 even if not
written to deal with as do programs that deal with
uninterpreted byte streams.  However, programs that perform
semantic processing on graphic characters must convert from
to runes in order to work properly with non‐input.  See

     Letting numbers be binary, a rune x is converted to a
multibyte sequence as follows:

     01. x in [00000000.0bbbbbbb] → 0bbbbbbb
10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb

     Conversion 01 provides a one‐byte sequence that spans
the character set in a compatible way.  Conversions 10 and
11 represent higher‐valued characters as sequences of two or
three bytes with the high bit set.  Plan 9 does not support
the 4, 5, and 6 byte sequences proposed by X‐Open.  When
there are multiple ways to encode a value, for example rune
0, the shortest encoding is used.

     In the inverse mapping, any sequence except those
described above is incorrect and is converted to rune 0080.