Judging the Severity of Usability Issues on Web Sites: This Doesn't Work
by Dr. Bob Bailey
Usually when a usability test or evaluation is conducted, usability professionals provide a list of potential usability issues for designers to change. These are the issues that are most likely to cause the greatest problems for real-world users. In order to focus their efforts on the most important issues, most designers want this list ranked with the worst usability issues at the top, and the least serious issues at the bottom. The research, however, shows quite clearly that even highly experienced usability professionals seriously disagree when ranking usability issues.
One of the first good studies on this issue was reported in 1998 by Michael Catani and David Biers at the University of Dayton. They had five usability professionals conduct an expert review to identify problems with a user interface prior to conducting a usability test. They identified several potential problems and rated the severity of each problem. The usability test had 30 participants perform a series of tasks using a prototype system.
The agreement of the five usability professionals in these severity ratings was very low. Of the nine serious problems identified, only about half were confirmed as serious by the usability testing. The correlation between the "judges' average severity rating" and the "number of participants that had a problem" was a low and unreliable -0.12. They concluded that the usability professionals were not able to agree with each other about severity (or agree with the data collected in the usability tests).
The same year Niels Jacobsen, Morten Hertzum, and Bonnie John conducted a study where four participants each spent an hour working through a set of task scenarios. Next, four usability specialists used the test data to detect and describe all usability problems they thought were in the interface. They then ranked all the problems according to severity. All usability specialists had access to the prototype system and videotapes from the usability test.
The individual judgments of severity were highly personal. More than half of the top-ten "most severe" problems were selected by only one usability specialist, and none of the problems was selected by all four specialists as being in the top-ten "most severe" category. The usability specialists disagreed on many issues, which suggested that a different combination of specialists would report differences in problem severity.
In 2001, Gilbert Cockton and Alan Woolrych had usability professionals attempt to predict usability issues. They found that these usability specialists were unable to reliably predict usability issues that differed either by frequency (high, medium, or low) or impact (severe, nuisance, or minor).
The same year, Hertzum and Jacobson (2001) showed that when conducting usability evaluations, any one evaluator is unlikely to detect the majority of the "severe" problems that will be detected collectively. In the study, the evaluators tended to perceive the problems they originally detected themselves as more severe than the problems detected by others. They named this consistent, highly predictable lack of agreement among evaluators as "the evaluator effect."
The CUE Studies
Rolf Molich's (2005) series of Comparative Usability Evaluations (CUE-1, CUE-2, and CUE-4) all have clearly shown that testers do not agree with other testers, evaluators do not agree with other evaluators, and that testers do not agree with evaluators. In the last test (CUE-4), for example, all teams were given the same scale for rating the severity of potential usability problems. Even so, there was considerable variation in which usability issues were judged critical, serious, and minor. There was almost no agreement as to which usability issues deserved to be part of the overall top 5 problems.
Over- and Underestimating Severity
Effie Lai-Chong Law and Ebba Thora Hvannberg (2004) had participants identify usability problems using two different types of heuristic evaluation (expert review) methods. The usability test had several people attempt to complete 10 task scenarios. Two usability specialists then combined the final set of usability issues and rated the potential impact of each of the final set of problems as severe, moderate, or minor. The severity ratings from the heuristic evaluations were compared with the actual usability testing results.
Again, the evaluators were able to predict the actual severity level of usability test problems only about half of the time. The participants overestimated severity in 22% of the cases, and underestimated severity in 78%. Overestimated severity would lead to a waste of effort in fixing problems, whereas underestimated severity could leave the system as non-usable as it was before the evaluation.
Most designers would like usability specialists to prioritize design problems that they found either by inspection evaluations (expert review) or usability testing. They would like the list of usability problems ranked by each issue's severity level. The research literature is fairly clear that even highly experienced usability specialists cannot agree on which usability issues will have the greatest impact on usability. This needs to be explained to designers (and some usability specialists) so that they quit expecting something that cannot be fully delivered.
So what should we do? Usability professionals and designers should come to a stronger agreement on what are true usability priorities by carefully studying usability findings. Both groups should understand that these priorities still may not be true priorities and therefore most or all usability issues that have been identified should be addressed by developers.
Catani, M.B. and Biers, D.W. (1998), Usability evaluation and prototype fidelity, Proceedings of the Human Factors and Ergonomics Society.
Cockton, G. and Woolrych, A. (2001), Understanding inspection methods: Lessons from an assessment of heuristic evaluation, Joint Proceedings of HCI 2001 and IHM 2001: People and Computers XV, 171-191.
Molich, R. (2005), www.dialogdesign.dk/cue.html
Dumas, J.S., Molich, R. and Jeffries, R. (2004), Describing usability problems: Are we sending the right message, Interactions, July-August, 24-29.
Hertzum, M. and Jacobsen, N.E. (2001), The evaluator effect: A chilling fact about usability evaluation methods, International Journal of Human-Computer Interaction, 13(4), 421-443.
Jacobsen, N.E., Hertzum, M. and John, B. (1998), The evaluator effect in usability studies: problem detection and severity judgments, Proceedings of the Human Factors and Ergonomics Society.
Law, E. L-C and Hvannberg, E.T. (2004), Analysis of strategies for improving and estimating the effectiveness of heuristic evaluation, NordiCHI '04, October, 241-250.