White paper
UBF(A) spec
UBF(B) spec
UBF(C) spec

Quick start
Erlang servers
Java clients

Message Box
File server




UBF(A) - quick summary

UBF(A) is the transport format, it was designed to be easy to parse and to be easy to write with a text editor. UBF(A) is based on a byte encoded virtual machine, 26 byte codes are reserved. Instead of allocating the bye codes from 0 we use the printable character codes to make the format easy to read.

Primitive types

UBF(A) has four primitive types, when a primitive tag is recognized it is pushed onto the "recognition stack" in our decoder. The primitive types are:

Integers are sequences of bytes which could be described by the regular expression [-][0-9]+, that is an optional minus (to denote a negative integer) and then a sequence of at least one digits. No restrictions are made as to the precision of the integer, precision issues will be dealt with in UBF(B).
Strings are written enclosed in double quotes, thus:

Within a string two quoting conventions are observed, " must be written \" and \ must be written \\ - no other quotings are allowed (this is so we can write a double quote within a string.

Binary Data
Uninterpreted blocks of binary data are encoded, thus:
    Int ~....~

First an integer, representing the length of the binary data is encoded, this is followed by a tilde, the data itself which must be exactly the length given in the integer and than a closing tilde. The closing tilde has no significance and is retained for readability. White space can be added Between the integer length and the data for readability.

Constants are encoded as strings, only using a single quote instead of a double quote.

Constants are commonly form in symbolic languages like lisp, Prolog or Erlang. In C they would be represented by hashed strings. The essential property of an constant is that two constants can be compared for equality in constant time. These are used for representing symbolic constants.

In addition any item can be followed by a semantic tag this is written `...` - with in the tag the close quote is quoted as in the strings encoding. This tag has no meaning in UBF(A) but might have a meaning in UBF(B). For example:

    12456 ~....~ `jpg`

Represents 12456 bytes of raw data with the semantic tag "jpg". UBF(A) does not know what "jpg" means - this is passed on to UBF(B) which might know what it means - finally the end application is expected to know what to do with an object of type "jpg", it might for example know that this represents an image. UBF(A) will just encode the tag, UBF(B) will type check the tag, and the application should be able to understand the tag.

Compound types

Having defined our four simple type we define two type of "glue" for making compound objects.

Structures are written:
    { Obj1 Obj2 ... Objn }

The byte codes for "{" and "}" are used to delimit a structure. Obj1..Objn are arbitrary UBF(A) objects. The decoder and encoder must map UBF(A) objects onto an appropriate representation in the application programming language (for example structs in C, arrays in java, tuples in Erlang etc.).

Structs are used to represent Fixed numbers of objects

Lists are used to represent variable numbers of objects. They are written with the syntax:
    # ObjN & ObjN-1 & ... & Obj2 & Obj1 &

This represents a list of objects - the first object in the list in Obj1 the second Obj2 etc.- Note that the objects are presented in reverse order. Lisp programmers will recognize # as an operator that pushes NIL (or end of list) onto the recognition stack and & as an operator that takes the top two items on the recognition stack and replaces them by a list cell.

Finally we need to know when an object has finished. The operator $ signifies "end of object". When $ is encountered there should be only one item on the recognition stack.

White space

For convenience blank, carriage return, line feed tab and comma are treated as white space. Comments can be included in UBF(A) with the syntax %...% the usual quoting convention applies.

Caching optimizations

So far we have used exactly 26 control, characters, namely:

This leaves us with 230 unallocated byte codes. These are used as follows: First byte code sequence

Where C is not one of the reserved byte codes or > means store the top of the recognition stack in the register reg[C] and pop the recognition stack.

Subsequent reuse of the single character CZ means "push reg[C] onto the recognition stack"