UBF(A)
UBF(A) - quick summary
UBF(A) is the transport format, it was designed to be easy to parse
and to be easy to write with a text editor. UBF(A) is based on a byte
encoded virtual machine, 26 byte codes are reserved. Instead of
allocating the bye codes from 0 we use the printable character codes
to make the format easy to read.
Primitive types
UBF(A) has four primitive types, when a primitive tag is recognized
it is pushed onto the "recognition stack" in our decoder. The
primitive types are:
- Integers
- Integers are sequences of bytes which could be described by the
regular expression [-][0-9]+, that is an optional minus (to
denote a negative integer) and then a sequence of at least one digits.
No restrictions are made as to the precision of the integer, precision
issues will be dealt with in UBF(B).
- Strings
- Strings are written enclosed in double quotes, thus:
"...."
Within a string two quoting conventions are observed,
" must be written \" and \ must be written
\\ - no other quotings are allowed (this is so we can write a double quote within a string.
- Binary Data
- Uninterpreted blocks of binary data are encoded, thus:
Int ~....~
First an integer, representing the length of the binary data is
encoded, this is followed by a tilde, the data itself which must be exactly the
length given in the integer and than a closing tilde. The closing tilde
has no significance and is retained for readability. White space can be
added Between the integer length and the data for readability.
- Constants
- Constants are encoded as strings, only using a single quote
instead of a double quote.
Constants are commonly form in symbolic languages like lisp, Prolog or
Erlang. In C they would be represented by hashed strings. The
essential property of an constant is that two constants can be compared for
equality in constant time. These are used for representing
symbolic constants.
In addition any item can be followed by a semantic tag this is
written `...` - with in the tag the close quote is quoted as in the strings encoding. This tag has no meaning in UBF(A) but might have a meaning in UBF(B). For example:
12456 ~....~ `jpg`
Represents 12456 bytes of raw data with the semantic tag "jpg".
UBF(A) does not know what "jpg" means - this is passed on to UBF(B)
which might know what it means - finally the end application is
expected to know what to do with an object of type "jpg", it might
for example know that this represents an image. UBF(A) will just
encode the tag, UBF(B) will type check the tag, and the application
should be able to understand the tag.
Compound types
Having defined our four simple type we define two type of "glue"
for making compound objects.
- Structs
- Structures are written:
{ Obj1 Obj2 ... Objn }
The byte codes for "{" and "}" are used to delimit a structure.
Obj1..Objn are arbitrary UBF(A) objects. The decoder and encoder
must map UBF(A) objects onto an appropriate representation in the
application programming language (for example structs in C, arrays
in java, tuples in Erlang etc.).
Structs are used to represent Fixed numbers of objects
- Lists
- Lists are used to represent variable numbers of objects. They are written with the syntax:
# ObjN & ObjN-1 & ... & Obj2 & Obj1 &
This represents a list of objects - the first object in the list in
Obj1 the second Obj2 etc.- Note that the objects
are presented in reverse order. Lisp programmers will recognize #
as an operator that pushes NIL (or end of list) onto the recognition stack
and & as an operator that takes the top two items on the recognition stack and replaces them by a list cell.
Finally we need to know when an object has finished.
The operator $ signifies "end of object". When $ is encountered there should be only one item on the recognition stack.
White space
For convenience blank, carriage return, line feed tab and comma
are treated as white space. Comments can be included in UBF(A) with the
syntax %...% the usual quoting convention applies.
Caching optimizations
So far we have used exactly 26 control, characters, namely:
%"~'`{}#&\s\n\t\r,-01234567890
This leaves us with 230 unallocated byte codes. These are used as follows:
First byte code sequence
>C
Where C is not one of the reserved byte codes or >
means store the top of the recognition stack in the register
reg[C] and pop the recognition stack.
Subsequent reuse of the single character CZ means
"push reg[C] onto the recognition stack"
|