Evaluating systems is a difficult task, and it becomes even more difficult when the system is adaptive. It is of crucial importance to be able to distinguish the adaptive features of the system from the general usability of the designed tool. This is probably why most studies of adaptive systems are comparisons of the system with and without adaptivity (Meyer, 1994; Boyle and Encarnacion, 1994; Brusilovsky and Pesin, 1995; Kaplan et al., 1993). The problem with those studies is obvious: the non-adaptive system may not have been designed 'optimally' for the task. At least this should be the case since adaptivity should prefereably be an inherehent and natural part of a system - when taken out the system is not complete. Still, it is very hard to prove that it is actually the adaptivity that makes the system better unless that condition can be compare with one without adaptivity.
An alternative view of how to see studies with users, is put forth by Oppermann, (1994), who prefers to see them as part of the design cycle. Since adaptivity is a complex machinery, there must be several rounds of studies which aid the designers in getting the adaptivity right. For example, if the adaptive hypermedia system is supposed to provide different kinds of information to users depending on their knowledge, goal or needs, it may be necessary to make several studies before the right relevance-criterion can be set up between the users goal and the preferred information content (or information presentation).
Another important issue is what to measure when evaluating the adaptivity. There are quite few studies of adaptive systems in general, and even fewer of adaptive hypermedia. In the studies by (Boyle and Encarnacion, 1994; Brusilovsky and Pesin, 1995; Kaplan et al., 1993) the main evaluation criterion is in task completion time. This should obviously be one important criterion by which some systems should be evaluated. In the case of the PUSH system, though, the goal of the adaptive hypermedia system is to provide the user with the right, most relevant, information and make sure that the user is not lost on their way to this information. Therefore task completion time can only be one measurement.
Boyle and Encarnacion also measured reading comprehension, while Kaplan et al. measured how many nodes the users visited: in their case the more nodes the users visited the better. Finally, Brusilovsky and Pesin measured how many times their students revisited 'concepts' they were attempting to learn.
Finally, a last difficulty in making studies of adaptive systems, is in the procedure of the study. Most adaptive systems will be really useful when they are part of the users work for a longer period: only during that longer period can we see how the users needs and goals varies in a 'natural' way. Obviously, this may not be feasible in a research project which has to be finished in limited time.