Inter-rater reliability is used to assess the degree to which different raters (or evaluators) agree when evaluating the same set of objects or instances. These measures help ensure consistency and reliability in research or assessment processes.
Measure | Pros | Cons |
---|---|---|
Percent Agreement | – Easy to calculate and understand – Any number of raters and categories | – Doesn’t account for chance agreement – Can lead to inflated reliability estimates |
Brennan and Prediger | – Corrects for chance agreement – Any number of raters and categories | – Requires more computation |
Cohen’s Kappa | – Corrects for chance agreement – Widely used and accepted | – Assumes equal rater distribution – Affected by category prevalence and data distribution – Accept only two raters |
Conger’s Kappa | – Extension of Cohen’s Kappa, but it allows for different rater distributions – More accurate when rater distributions differ | |
Scott’s Pi | – Corrects for chance agreement – Less sensitive to imbalanced data | – Assumes equal rater distribution |
Fleiss’ Kappa | – Corrects for chance agreement – Extension of Scott’s Pi, but it can be used with any number of raters | – Assumes equal rater distribution |
Gwet’s AC | – Corrects for chance agreement – Addresses some limitations of Cohen’s Kappa and other similar statistics – Less sensitive to category prevalence (the proportion of instances belonging to a specific category) and data distribution (the difference in how often raters use certain categories) | |
Krippendorff’s Alpha | – Flexible with raters, measurement levels, and sample sizes – Allows for missing data and varying metric properties – Corrects for chance agreement |
kappaetc calculates various measures of interrater agreement along with their standard errors and confidence intervals. Statistics are calculated for any number of raters, any number of categories, and in the presence of missing values (i.e. varying number of raters per subject). Disagreement among raters may be weighted by user-defined weights or a set of prerecorded weights suitable for any level of measurement.
ssc install kappaetc // install package
The code for calculation is really simple! You can simply list the variables of multiple raters’ ratings.
kappaetc rater1 rater2 rater3 ...
The results will show the six types of inter-rater reliability. The scores could vary by the type of inter-rater reliability, as mentioned above, due to the differences in dealing with the chance agreement, prevalence, and bias.
You can interpret the results based on these benchmark cutoff scores. The following is copied from the package document for kappaetc .
Reference
The following materials are provided by the person who developed kappaetc . You can see the formula and background literature of each method.