The visual systems are very advanced but are mainly used in controlled environments where the tracking of head movements is feasible. In practice this means that you will have to stay within meters of stationary system or loose the accuracy needed to do exact augmentation.
An alternative approach is to use only the video image captured by the user to generate additional information. This means that the system doesn't really know where you are, only what you see. This can be quite sufficient for many applications.
A problem for image visualization in a mobile system is the delay inherent in any radio based connection. The delay can only be a couple of milliseconds from the movement of the head to the visual feed-back. If the delay is longer the user will experience a dragging effect of all generated information. The delay in a GSM-Internet connection is today about 700ms. This can be decreased as the Internet connection is moved closer to the base station (and the GSM protocol) but we are far from the delays required for exact augmentation.
The simplest kind of visualization is when the generated information does not need to be in synch with the movement of the head. The wearable system will in this case be much smaller and easier to handle.
Several projects have been working with guided tours of museums.
The bibliography by Barry Arons has a collection of papers on interactive speech.