Wednesday, June 10, 2009

Inter-Rater or Inter-Observer Reliability

Whenever you use humans as a part of your measurement procedure, you have to worry about whether the results you get are reliable or consistent. People are notorious for their inconsistency. We are easily distractible. We get tired of doing repetitive tasks. We daydream. We misinterpret.

So how do we determine whether two observers are being consistent in their observations? You probably should establish inter-rater reliability outside of the context of the measurement in your study. After all, if you use data from your study to establish reliability, and you find that reliability is low, you're kind of stuck. Probably it's best to do this as a side study or pilot study. And, if your study goes on for a long time, you may want to reestablish inter-rater reliability from time to time to assure that your raters aren't changing.

This type of reliability is assessed by having two or more independent judges score the test. The scores are then compared to determine the consistency of the raters estimates. One way to test inter-rater reliability is to have each rater assign each test item a score.

Inter-rater or inter-observer reliability is an estimation method that is used when your measurement procedure is applied by people. We are all subject to distractions, tiredness and a whole host of other effects upon our consistency, and if you are using people as, say, observers, you will want to have some estimation of the reliability and consistency of the people doing the observing.

Two major ways in which inter-rater reliability is used are
(a) testing how similarly people categorize items, and
(b) how similarly people score items.

This is the best way of assessing reliability when you are using observation, as observer bias very easily creeps in. It does, however, assume you have multiple observers, which is not always the case.

Inter-rater reliability is also known as inter-observer reliability or inter-coder reliability.

There are two major ways to actually estimate inter-rater reliability.

• those where observers are checking off which category an observation falls into (categorization). If your measurement consists of categories -- the raters are checking off which category each observation falls in -- you can calculate the percent of agreement between the raters. For instance, let's say you had 100 observations that were being rated by two raters. For each observation, the rater could check one of three categories. Imagine that on 86 of the 100 observations the raters checked the same category. In this case, the percent of agreement would be 86%. OK, it's a crude measure, but it does give an idea of how much agreement exists, and it works no matter how many categories are used for each observation.

• those where observers are ranking their observations against a continuous scale, such as a Likert scale. The other major way to estimate inter-rater reliability is appropriate when the measure is a continuous one. There, all you need to do is calculate the correlation between the ratings of the two observers. For instance, they might be rating the overall level of activity in a classroom on a 1-to-7 scale. You could have them give their rating at regular time intervals (e.g., every 30 seconds). The correlation between these ratings would give you an estimate of the reliability or consistency between the raters.

You might think of this type of reliability as "calibrating" the observers. There are other things you could do to encourage reliability between observers, even if you don't estimate it. For instance, I used to work in a psychiatric unit where every morning a nurse had to do a ten-item rating of each patient on the unit. Of course, we couldn't count on the same nurse being present every day, so we had to find a way to assure that any of the nurses would give comparable ratings. The way we did it was to hold weekly "calibration" meetings where we would have all of the nurses ratings for several patients and discuss why they chose the specific values they did. If there were disagreements, the nurses would discuss them and attempt to come up with rules for deciding when they would give a "3" or a "4" for a rating on a specific item. Although this was not an estimate of reliability, it probably went a long way toward improving the reliability between raters.

Two people may be asked to categorize pictures of animals as being dogs or cats. A perfectly reliable result would be that they both classify the same pictures in the same way.

Observers being used in assessing prisoner stress are asked to assess several 'dummy' people who are briefed to respond in a programmed and consistent way. The variation in results from a standard gives a measure of their reliability.

In a test scenario, an IQ test applied to several people with a true score of 120 should result in a score of 120 for everyone. In practice, there will be usually be some variation between people.


No comments: