Documents
Home
White paper
Paper
UBF(A) spec
UBF(B) spec
UBF(C) spec

Tutorials
Quick start
Erlang servers
Java clients

Services
IRC
Message Box
File server

Downloads
Downloads

White paper
2002-03-05
What is the relationship between UBF and XML?

Both UBF and XML have similar goals and similar structure. XML has a supposedly human friendly syntax and is supposed to be self describing. UBF is programmer friendly and is designed to be efficient and easy to implement.
Why is UBF parsing efficient

UBF(A) terms are actually little programs which the decoder executes. When reconstructing a UBF(A) term the decoder just executes the program. No grammar checking or parsing is actually involved.

UBF(A) encoded terms are also much more compact than the equivalent XML terms. Since parsing an import stream involves looking at every character on the input UBF parsing is intrinsically much more efficient than XML parsing. A UBF(A) encoder can also make intelligent use of the caching optimization to further improve efficiency.
What is the relation between a UBF type and and XML DTD?

A UBF type is very similar to an XML DTD. For example:

    
    +TYPE p() = {person, name(), age()}.
    +TYPE name() = string().
    +TYPE age() = int().
    
Is almost equivalent to the following:
    
    <!ELEMENT p (person, name, age) >
    <!ELEMENT person EMPTY>
    <!ELEMENT name (#PCDATA)>
    <!ELEMENT age (#INT)>
    

Note XML DTD's can't have types so #INT is illegal here. Instead we have to use XML schemas:

    
    ... lots
    
What's the difference between an XML encoding and a UBF encoding?

...
Is UBF(A) a high or low level transport format?

Surprisingly UBF(A) is both a high level and a low level transport format.

It is low-level in so much that it is very simple - there are only four primitive data types (strings, integers, constants and memory buffers) Some might argue that two would suffice - integers and memory buffers - and there are two types of "glue" one from building "structs" the other for "lists".

It is high-level in so much that we purposely do not let any<(i> implementation details pollute the interface. Thus we talk about "Integers" and not "32 bit integers" - We leave the interpretation (i.e. we answer the question "What is an integer") to the application.

At some point in the future I assume that all computers will "understand" multi-precision integers and that weird things like 32 integers will be viewed as atavistic aberrations - if we want our protocols to be "future proof" we'd better free ourselves from "bit-ism" when transmitting integers.

    
    God made the integers; all the rest is the work of man.
    

    - Leopold Kronecker

We also deliberately omit an talk of complex data types (like floats) in UBF(A). On the other hand if we want to transmit IEEE floats we should use the semantic tagging facilities of UBF(A) and encode them thus:


    4~ssss~ `ieee754.single`
    8~dddddddd~ `ieee754.double`

The semantic tags `ieee754.single` etc would be cached by any sensibly UBF(A) encoder - it would also be expected that the application "understood" what the meaning of the semantic tags.
What gets sent out on the wire

Recall out previous definition:

    
    +TYPE p() = {person, name(), age()}.
    +TYPE name() = string().
    +TYPE age() = int().
    
To transmit a "person" we might send:
    
    {'person'"Jane"18}$
    

The equivalent XML might be:

    
    <p><person/><name>Jane</name><age>18</age></p>
    
    

Which would take roughly twice as long to parse - since we must at a very minimum inspect each individual character on the input stream.
Why don't we have a richer set of types?

because:

  • Applications would disagree as to how these types should be represented.
  • They are largely unnecessary.
  • They would lead to overly complex interfaces.

As an example of the first point consider a UBF(B) contract for some web service - we might write:

    
    +STATE active
       {do, operation()} => bool() & active;
       ...
    

At this level of abstraction we have defined that the result of an operation is a boolean - but we have not said how a boolean is represented.

The concrete representation of a boolean can be left to the type declaration, it might be any one of the following:

    
    +TYPE bool() = true | false.
    +TYPE bool() = 1 | 0.
    +TYPE bool() = "yes" | "no".
    ...
    

The type declaration says HOW a value of a boolean will be encoded in UBF(A) - for the above definitions the UBF(A) encodings of "true" would be:

    
    'true'$
    1$
    "yes"$
    

To keep things nicely abstract, we should not mix representation with type abstraction.

The second point is that in my experience we do not need many different types to make an interface. Most interfaces involve only the interchange of simple constants or structs of strings and constants.

Kronecker would have been satisfied with just the integers - but I think we need to impose a little more structure on the interface. Four basic types and two types of glues seems perfectly adequate.

XML schemas offers a large number of built-in types. But using such an interface would lead to complex interfaces and would be difficult to implement.

Rich sets of primitive types and glue in the interface leads to complex and unimplementable systems. The goal in UBF is to keep things as simple as possible. Semantic tagging is used to pass the problem of interpretation of data to the application (where it belongs!).
History

UBF was inspired from a number of different sources:

  • The idea of a "universal format" for high level languages is as old as the hills - Lisp S expressions could have been used for all RFCs (sadly they were not).
  • The type notation in UBF(B) is similar to that suggested by Phil Wadler and Simon Marlow for work on an Erlang type checker.
  • The caching mechanism in UBF(A) comes from the Erlang Marshalling/unmarshalling technique.
  • The protocol definition language in UBF(B) is similar to a suggestion of Wadler for typing Erlang processes.
  • The encoding scheme in UBF(A) is inspired from an early implementation of Prolog.
  • Semantic tagging is due to a discussion with Thomas Arts and ?.
  • This scheme has been improved beyond measure by discussion with Thomas Arts, Luke Gorrie, Seif Haridi, Per Brand ...
)