Search CORE

62 research outputs found

Assessing Agreement Among Raters And Identifying Atypical Raters Using A Log-Linear Modeling Approach

Author: Kastango Kari B.
Publication venue
Publication date: 06/06/2006
Field of study

When an outcome is rated by several raters, ensuring consistency across raters increases the reliability of the measurement. Tanner and Young (1985) proposed a general class of log-linear models to assess agreement among K raters and a rating scale with C nominal categories. Their methodology can be used to assess pair-wise agreement among three or more raters. Rogel et al. (1996, 1998) extended this work by assessing various patterns of agreement among rater sub-groups of size K-1. These models can be used to test the assumption of rater exchangeability. Although parameters from these models can be used to identify atypical raters, no formal inferential procedures are available. I propose a formal inferential approach that can be used to test the assumption of rater exchangeability and to identify an atypical rater. The global and heterogeneous partial agreement model is fit to the data and pair-wise comparisons of the K partial agreement parameters are made, adjusting the p-values for the multiple comparisons made. The heterogeneous partial agreement parameter that is constantly involved in the pair-wise comparisons that are statistically significant is distinguished. The premise is that, if there is an atypical rater, at least one heterogeneous partial agreement parameter will differ from at least one of the remaining K-1 partial agreement parameters. The approach is illustrated using published data from an intestinal biopsy rating study with six raters (Rogel et al., 1998). Overall Type I error and the power of the inferential approach to correctly identify atypical raters are assessed via simulation with rater sub-groups of size 5. The Bonferroni, Sidak, and Holm's Step-down procedures using the Bonferroni and Sidak adjustments are used to control the overall Type I error. Being able to correctly identify an atypical rater, if present, and improving the consistency of ratings directly, influence the reliability of the measurement and the power of the study for a given sample size. Consequently, more informative studies can be conducted of interventions (e.g., behavioral, medicinal) that may have a significant positive impact on the public's health

D-Scholarship@Pitt