A crossed random effects modeling approach for estimating diagnostic accuracy from ordinal ratings without a gold standard

Yunlong Xie; Zhen Chen; Paul S Albert

doi:10.1002/sim.5784

A crossed random effects modeling approach for estimating diagnostic accuracy from ordinal ratings without a gold standard

Yunlong Xie, Zhen Chen, Paul S Albert

Statistics in medicine · 2013 · doi:10.1002/sim.5784 · PMID:23529923 · PMC3740052

other OA: green public-domain-us

📄 Open PDF Full text JSON View on PubMed View at publisher

⚙ AI-generated summary by claude@2026-06, 2026-06-13 ⓘ

This paper proposes a crossed random effects model for estimating diagnostic accuracy from ordinal ratings without a gold standard, finding regional experts better classify intermediate endometriosis stages than residents.

One-sentence paraphrase of the abstract; not a substitute for reading it. No clinical advice. How this works

⚙ AI-generated deep summary by claude@2026-06, 2026-06-13 · read from full text ⓘ

The paper develops a latent class model to estimate diagnostic accuracy for ordinal tests when there is no gold-standard true disease status, focusing on relaxing the usual conditional independence assumption by introducing crossed subject- and rater-specific random effects (plus robustness to misspecification via mixtures of normals). Motivated by the Physician Reliability Study, it models 8 physicians’ endometriosis stage ratings (0–4) for 79 subjects and estimates misclassification using a cumulative ordinal model with random effects, using Monte-Carlo EM for maximum likelihood estimation. Key outputs include estimates of stage-specific diagnostic performance and an agreement measure (kappa reported as 0.379), while a stated caveat is that the method’s validity relies on model assumptions such as the random-effects structure and identifiability choices. This paper is centrally about endometriosis — specifically, it applies crossed random effects latent class modeling to quantify diagnostic accuracy of physician endometriosis staging without a gold standard.

Read from the paper's body, not the abstract. Not a substitute for reading the paper. No clinical advice. How this works

Abstract

In diagnostic studies without a gold standard, the assumption on the dependence structure of the multiple tests or raters plays an important role in model performance. In case of binary disease status, both conditional independence and crossed random effects structure have been proposed and their performance investigated. Less attention has been paid to the situation where the true disease status is ordinal. In this paper, we propose crossed subject-specific and rater-specific random effects to account for the dependence structure and assess the robustness of the proposed model to misspecification in the random effects distributions. We applied the models to data from the Physician Reliability Study, which focuses on assessing the diagnostic accuracy in a population of raters for the staging of endometriosis, a gynecological disorder in women. Using this new methodology, we estimate the probability of a correct classification and show that regional experts can more easily classify the intermediate stage than resident physicians.

Full text 33,885 characters · extracted from pmc-nxml · 6 sections · click to expand

Section 3

In this section, we introduce the method for obtaining the MLEs of the model parameters. Due to the high dimensional integration and summation, it can be challenging to use the classical EM algorithm to estimate ( 4 ). Instead, we follow Wei and Tanner [ 10 ] in using the Monte-Carlo EM (MCEM) algorithm. The MCEM incorporates a Monte Carlo implementation into the E-step of EM algorithm so that the required expectation can be approximated by the average of the Monte-Carlo samples from the target distribution. By using the Metropolis-Hasting algorithm to draw samples from target distributions, McCulloch [ 11 ] developed maximum likelihood algorithms for generalized linear mixed models. Robert and Casella [ 12 ] gave examples of R implementations using the MCEM algorithm. Here, we treat the true disease status { D i } i = 1 I , subject-specific random effects { b i } i = 1 I and rater-specific { c j } j = 1 J as missing data. For notational brevity, let Y ∗ = ( Y 1 ′ , … , Y I ′ ) be the observed data, Z ∗ = { { D i } i = 1 I , { b i } i = 1 I , { c j } j = 1 J } ′ be the missing data and W * = ( Y *, Z *)’ be the complete data. Accordingly, the complete-data likelihood is (6) log f ( w ∗ ∣ θ ) = ∑ i = 1 I ∑ j = 1 J ∑ k = 0 K I ( Y i j = k ) log P ( Y i j = k ∣ D i = d i , b i , c j , X i j ) + ∑ i = 1 I log [ g b ( b i ) ] + ∑ j = 1 J log [ g c ( c j ) ] + ∑ i = 1 I log π d i + constant , For our purpose, we can obtain the kernel of the target distribution by Bayes Rule: f ( z ∗ ∣ y ∗ , θ ) = f ( y ∗ ∣ z ∗ , θ ) f ( z ∗ ∣ θ ) f ( y ∗ ∣ θ ) ∝ f ( y ∗ ∣ z ∗ , θ ) f ( z ∗ ∣ θ ) ∝ ∏ i = 1 I ∏ j = 1 J ∏ k = 0 K [ P ( Y i j = k ∣ d i , b i , c i , X i j ) I ( Y i j = k ) ] × ∏ i = 1 I [ g b ( b i ∣ θ ) ] × ∏ j = 1 J [ g c ( c j ∣ θ ) ] × π 1 I 1 × π 2 I 2 × ⋯ × π D I D × ( 1 − ∑ d = 1 D π d ) ( 1 − Σ d − 1 D I d ) , where I d = ∑ i = 1 I I ( D i = d ) for d = 0, 1, …, D . We implement the MCEM algorithm as follows: Start by initial values θ (0) = ( γ (0) , σ (0) , τ (0) , π (0) , β (0) )’ and set r = 0. Generate M groups of values z * (1) , z * (2) , …, z * ( M ) from the target distribution f ( z *| y *, θ ( r ) ) and choose θ ( r +1) to maximize 1 M ∑ m = 1 M log [ f ( w ∗ ( m ) ∣ θ ) ] f ( z ∗ ( m ) ∣ y ∗ , θ ( r ) ) , which is the Monte-Carlo estimate of Q ( θ ∣ θ ( r ) ) = E [ log f ( W ∗ ∣ θ ) ∣ y ∗ , θ ( r ) ] = ∫ log f ( w ∗ ∣ θ ) f ( z ∗ ∣ y ∗ , θ ( r ) ) d z ∗ , where θ ( r ) is the parameter value from the r th iteration and set r = r + 1. The maximization is subject to the restrictions that elements in each column of matrix γ are nondecreasing from the top to the bottom, all the elements in σ, τ and π are nonnegative and that all the elements in π sum up to one. R function nloptr can be employed for such constrained maximization. We generate sample z * ( m ) ( m = 1, …, M ) from the target distribution by Metropolis-Hasting algorithm: Let f ~ ( r ) ( d , b , c ∣ w ∗ , θ ( r ) ) ≡ ∏ i = 1 I f ~ D ( r ) ( d i ) f ~ b ( r ) ( b i ) ∏ j = 1 J f ~ c ( r ) ( c j ) denote the candidate distribution with f ~ D ( r ) ( d i ) being the empirical distribution of d i and f ~ b ( r ) ( b i ) and f ~ c ( r ) ( c j ) being t distribution centered at the means of b i and c j sampled from the r th iteration of the EM algorithm respectively. Choose z 0 from the support of the target distribution. For n = 1, 2, …: generate U n ~ uniform(0, 1) and v n ∼ f ~ ( r ) ( d , b , c ∣ y ∗ , z n − 1 , θ ( r ) ) . Set z n = { v n if U n ≤ ω z n − 1 if U n > ω } , where ω = min ( f ( v n ∣ y ∗ , θ ( r ) ) f ~ ( r ) ( z n − 1 ∣ y ∗ , z n − 1 , θ ( r ) ) f ( z n − 1 ∣ y ∗ , θ ( r ) ) f ~ ( r ) ( v n ∣ y ∗ , z n − 1 , θ ( r ) ) , 1 ) . As n → ∞, z n converges to z * ~ f ( z *| y *, θ ) in distribution. In order to generate random effects b i ’s and c j ’s from the candidate distribution, we choose 30 as the degrees of freedom of the t distributions mentioned in part (a) so that they do not have heavy tails. We repeat Step 2 a sufficient number of iterations until the estimates become stable. Start by initial values θ (0) = ( γ (0) , σ (0) , τ (0) , π (0) , β (0) )’ and set r = 0. Generate M groups of values z * (1) , z * (2) , …, z * ( M ) from the target distribution f ( z *| y *, θ ( r ) ) and choose θ ( r +1) to maximize 1 M ∑ m = 1 M log [ f ( w ∗ ( m ) ∣ θ ) ] f ( z ∗ ( m ) ∣ y ∗ , θ ( r ) ) , which is the Monte-Carlo estimate of Q ( θ ∣ θ ( r ) ) = E [ log f ( W ∗ ∣ θ ) ∣ y ∗ , θ ( r ) ] = ∫ log f ( w ∗ ∣ θ ) f ( z ∗ ∣ y ∗ , θ ( r ) ) d z ∗ , where θ ( r ) is the parameter value from the r th iteration and set r = r + 1. The maximization is subject to the restrictions that elements in each column of matrix γ are nondecreasing from the top to the bottom, all the elements in σ, τ and π are nonnegative and that all the elements in π sum up to one. R function nloptr can be employed for such constrained maximization. We generate sample z * ( m ) ( m = 1, …, M ) from the target distribution by Metropolis-Hasting algorithm: Let f ~ ( r ) ( d , b , c ∣ w ∗ , θ ( r ) ) ≡ ∏ i = 1 I f ~ D ( r ) ( d i ) f ~ b ( r ) ( b i ) ∏ j = 1 J f ~ c ( r ) ( c j ) denote the candidate distribution with f ~ D ( r ) ( d i ) being the empirical distribution of d i and f ~ b ( r ) ( b i ) and f ~ c ( r ) ( c j ) being t distribution centered at the means of b i and c j sampled from the r th iteration of the EM algorithm respectively. Choose z 0 from the support of the target distribution. For n = 1, 2, …: generate U n ~ uniform(0, 1) and v n ∼ f ~ ( r ) ( d , b , c ∣ y ∗ , z n − 1 , θ ( r ) ) . Set z n = { v n if U n ≤ ω z n − 1 if U n > ω } , where ω = min ( f ( v n ∣ y ∗ , θ ( r ) ) f ~ ( r ) ( z n − 1 ∣ y ∗ , z n − 1 , θ ( r ) ) f ( z n − 1 ∣ y ∗ , θ ( r ) ) f ~ ( r ) ( v n ∣ y ∗ , z n − 1 , θ ( r ) ) , 1 ) . As n → ∞, z n converges to z * ~ f ( z *| y *, θ ) in distribution. In order to generate random effects b i ’s and c j ’s from the candidate distribution, we choose 30 as the degrees of freedom of the t distributions mentioned in part (a) so that they do not have heavy tails. We repeat Step 2 a sufficient number of iterations until the estimates become stable. Let f ~ ( r ) ( d , b , c ∣ w ∗ , θ ( r ) ) ≡ ∏ i = 1 I f ~ D ( r ) ( d i ) f ~ b ( r ) ( b i ) ∏ j = 1 J f ~ c ( r ) ( c j ) denote the candidate distribution with f ~ D ( r ) ( d i ) being the empirical distribution of d i and f ~ b ( r ) ( b i ) and f ~ c ( r ) ( c j ) being t distribution centered at the means of b i and c j sampled from the r th iteration of the EM algorithm respectively. Choose z 0 from the support of the target distribution. For n = 1, 2, …: generate U n ~ uniform(0, 1) and v n ∼ f ~ ( r ) ( d , b , c ∣ y ∗ , z n − 1 , θ ( r ) ) . Set z n = { v n if U n ≤ ω z n − 1 if U n > ω } , where ω = min ( f ( v n ∣ y ∗ , θ ( r ) ) f ~ ( r ) ( z n − 1 ∣ y ∗ , z n − 1 , θ ( r ) ) f ( z n − 1 ∣ y ∗ , θ ( r ) ) f ~ ( r ) ( v n ∣ y ∗ , z n − 1 , θ ( r ) ) , 1 ) . As n → ∞, z n converges to z * ~ f ( z *| y *, θ ) in distribution. In order to generate random effects b i ’s and c j ’s from the candidate distribution, we choose 30 as the degrees of freedom of the t distributions mentioned in part (a) so that they do not have heavy tails. In the simulation studies and the analysis of the PRS data presented in Sections 4 and 5, we run the algorithms 200 steps. We take M = 50 for iterations r = 1, …, 50, M = 200 for iterations r = 51, …, 80 and M = 1000 for iterations r = 81, …, 200. Final estimates are computed by averaging θ ^ ( r ) ( r = 191 , … , 200 ) over the last ten iterations. We wish to compare different random effects distributions for { b i }, { c j } in ( 2 ) using penalized likelihood method such as the Akaike information criterion (AIC). However, the likelihood is difficult to evaluate given the high dimension of the random effects. We use the simulated likelihood approach by Geyer and Thompson [ 13 ] to overcome this difficulty. Specifically, let θ ^ = ( γ ^ , σ ^ , τ ^ , π ^ , β ^ ) ′ be the MLE obtained from the MCEM algorithm and in ( 4 ) consider L ( θ ∣ y ) = E [ ∏ i = 1 I ∏ j = 1 J ∏ k = 0 K P ( Y i j = k ∣ D i = d i , b i , c j , X i j ) I ( Y i j = k ) ] , where the expectation is taken over all random effects { b i }, { c j } and { d i }. The maximized likelihood L ( θ ^ ∣ y ) can be approximated by L ^ ( θ ^ ∣ y ) = 1 T ∑ t = 1 T [ ∏ i = 1 I ∏ j = 1 J ∏ k = 0 K P ^ ( Y i j = k ∣ D i = d i ( t ) , b i ( t ) , c j ( t ) , X i j ) I ( y i j = k ) ] × ∏ i = 1 I π ^ d i ( t ) g b ( b i ( t ) ) ∏ j = 1 J g c ( c j ( t ) ) , where b i ( t ) , c j ( t ) and d i ( t ) are the t -th simulated realizations from the density functions g b for the i -th subject, g c for the j -th rater and mass functions multinomial ( 1 , ( π ^ 0 , … , π ^ D ) ) for the i -th subject respectively, T is the total number of such simulated realizations and P ^ ( Y i j = k ∣ D i = d i ( t ) , b i ( t ) , c j ( t ) , X i j ) = Φ ( γ ^ d i ( t ) , k + σ ^ d i ( t ) b i ( t ) + τ ^ d i ( t ) c j ( t ) + X i j β d i ( t ) ) − Φ ( γ ^ d i ( t ) , k − 1 + σ ^ d i ( t ) b i ( t ) + τ ^ d i ( t ) c j ( t ) + X i j β d i ( t ) ) by ( 3 ). The Monte-Carlo estimate of the Akaike Information Criteria A I C ^ ≡ 2 × n p a r − 2 log L ^ ( θ ^ ∣ y ) can be used to compare different models for the random effects, where n par is the number of parameters in the model. In the simulation studies and the analysis of the PRS data, we take T = 5000 Proving global identifiability for complex latent class models such as the one we are proposing is a difficult and challenging problem. In fact, at the very least, these models lack identifiability due to a label-switching problem. Practitioners need to be cautious when applying these methods to be sure that results are scientifically sensible. Various authors have worked on the issue of local identifiability (e.g., [ 14 , 15 ]). This latter form of identifiability can be proved by showing that the Hessian matrix is non-singular. Unfortunately this is difficult for our model since the Hessian matrix can only be estimated by using Monte-Carlo method and its estimate can be unstable when the number of parameters is large.

Section 4

In this section, we perform simulation studies to assess the convergence and operating characteristics of the MCEM algorithm. For simplicity, we focus on the case of a homogeneous group of physicians (i.e., no covariates in the model). For notational brevity, we therefore write P ( Y ~ = D ~ ) in place of P ( Y ~ = D ~ ∣ X = x ) . Data were simulated according to model ( 1 ) with random effects following distribution ( 2 ). All simulations were performed with I = 100 subjects and J = 10 raters. The true prevalences are 0.3, 0.25, 0.2, 0.15, 0.1 for no endometriosis, stage I to IV respectively, and the true misclassification matrix is as in Table 2 . These values are chosen to be close to the parameter estimates in the PRS shown in Section 5. Based on the parameters in Table 2 , the true probability of a correct classification is P ( Y ~ = D ~ ) = 0.454 . To assess the convergence of the MCEM algorithm and the Monte-Carlo variations, we simulated data sets based on random effects with normal and MixNs respectively. We obtained the MLEs of the parameters by assuming that the random effects follow the true corresponding distributions. The algorithm was performed based on two different seeds and three different starting values of the probability of a correct classification: 0.8, 0.2 and 0.5, for larger than, smaller than and close to the true value respectively. We show in Figure 1 that the convergence of the estimated probability of a correct classification P ^ ( Y ~ = D ~ ) is insensitive to the choice of either the starting values or the seeds of the MCEM algorithm. This empirically suggests that the model is locally identifiable. The convergence of each element in θ is similar to that of P ^ ( Y ~ = D ~ ) and is not presented here to save space. Specifically, the algorithm converges at around the 80 th iteration and appears stable thereafter with tolerable Monte-Carlo variations. This is probably because we increased the size of Monte-Carlo samples from the target distribution in the E-step from 200 to 1000 at the 80 th iteration. Similar results are observed regardless of whether the distributions of random effects are normal or MixNs. We performed similar simulations using different parameter values and observed the same results (data not shown). For the algorithm to reach the approximate convergence (within Monte-Carlo error), it took approximately 16 hours on an NIH linux cluster with a 2.8 GHz Intel X5660 processor. To assess the operating characteristics of the MCEM algorithm, we simulated different data sets based on different random effects distributions. More specifically, in ( 2 ), we choose λ = λ b = λ c = 0, 0.25 or 0.5 with the corresponding distributions shown in Figure 2 . For each λ , we generated 200 data sets and obtained the MLEs of model parameters using each individual data set. In Table 3 , we summarize the estimated prevalence, with the standard error in parentheses. When λ = 0, the data are simulated according to normally distributed subject-specific and rater-specific random effects. When a normal distribution is assumed as the working model, the estimates are mostly unbiased. This is also true when MixNs are assumed since normal random effects are special cases of MixN random effects. In contrast, when λ ≠ 0, the data are simulated according to a model with the subject and rater-specific random effects following MixNs. When MixNs are assumed as the working model, the estimates are nearly unbiased. However, when normal distributions are assumed, most estimates are biased. Moreover, the biases are more profound when λ = 0.5 where the misspecification of the random effects distribution is more severe than when λ = 0.25 (see Figure 2 ). The estimated misclassification matrix in Table 4 shows a similar pattern. For example, when λ = 0, the estimates of P ( Y = 4 |D = 4) are nearly unbiased under both random effects distributions, 0.262 and 0.271 for normal and MixNs respectively compared to the true value of 0.25; when λ ≠ 0 and MixNs are assumed as the working model, the estimates are mostly unbiased: 0.237 and 0.231 for λ = 0.25 and 0.5 respectively. When normal distributions are assumed, the estimates are biased: 0.326 and 0.381 for λ = 0.25 and 0.5 respectively. Clearly the bias is larger (0.381) when λ = 0.5 than that (0.326) when λ = 0.25. When normal distributions are assumed for the random effects in the true model and either normal or MixNs in the working model, the estimated probabilities of a correct classification P ^ ( Y ~ = D ~ ) (standard errors) are 0.466(0.087) and 0.472(0.127) respectively, and both are close to the true probability of a correct classification, 0.454. However, when λ = 0.25 or 0.5, namely the true distributions of the random effects are MixNs, fitting a working model with normal random effects distribution resulted in a biased probability of a correct classification: 0.548(0.095) and 0.602(0.105), respectively (compared with a true value of 0.454). In comparison, the corresponding estimates under a correctly specified working models are 0.466(0.139) and 0.464(0.146) respectively. Since it is important to specify the distribution of random effects correctly, we are interested in the empirical rates that the penalized likelihood criteria select the true model. By simulation, Zhang et al [ 16 ] showed that when the true disease status and outcomes from raters are both binary, penalized likelihood criteria empirically select the true model at the rate of 55%, which is slightly higher than 50%. Here, when the true disease status and outcomes are both ordinal, the empirical rate that penalized likelihood criteria select the true model is 74%, 73% and 76% for λ = 0, 0.25 and 0.5 respectively. Thus, compared to the method of dichotomizing ordinal data, the proposed method is empirically more likely to select the true model and estimate the parameters accurately. This is likely due to the increased information contained in the ordinal data.

Section 5

We now apply the proposed modeling framework to data from the Physician Reliability Study [ 9 ]. Eight physicians (4 residents and 4 regional experts) reviewed digital images on 79 subjects. Scientific substantive questions include: (1) how did these physicians stage endometriosis? Were they more likely on target or within one stage? (2) are the physicians better at diagnosing some particular stages as compared to others? (3) are the findings in questions (1) and (2) different for the 4 residents as compared with the 4 regional experts? With respect to the 8 physicians, we first ignore their difference, treating them as a homogeneous group. Later we will account for their difference (residents vs regional experts) in the modeling framework. Among all the four possible combinations of subject and rater-specific random effects the AIC is the smallest (3219.47) for the model with normal subject-specific random effects and MixN rater-specific random effects. Under the best model, the prevalences of the five stages are estimated at 0.321(0.11), 0.238(0.08), 0.211(0.03), 0.151(0.06) and 0.079(0.06) respectively, and the estimated misclassification matrix (with the bootstrap standard errors based on 1000 bootstrapped samples in parentheses) is presented in Table 5 . The estimated probability of a correct classification is P ^ ( Y = D ) = 0.515 . In Table 5 , the diagonal elements represent the average diagnostic accuracy across the population of the physicians for each true endometriosis stage. For example, when the true stage of endometriosis is mild ( d = 2), the average probability that the physicians would correctly stage the disease is 0.289. In each row, the elements adjacent to the diagonal elements are important since they represent the situation when the physicians categorize the diagnostic result only one stage off from the truth. In the PRS, the physicians are more likely to underestimate the severity by one stage than overestimate by one stage. Moreover, it appears that these physicians are better at diagnosing lower stages of endometriosis (no endometriosis and mild stage) than higher ones, with the correction classification of 75%, 63%, 29%, 26% and 29% for the five stages (from low to high), respectively. Since there are two physician groups (residents and regional experts) serving as raters, it is of interest to see whether the above conclusions hold when we account for the difference between the two groups of physicians with different levels of experience. In doing so, we utilize a modified version of model ( 1 ): P ( Y i j ≤ k ∣ d i , b i , c j , X i j ) ≡ Φ ( γ d i , k + σ d i b i + τ d i c j + β d i I ( j is resident ) ) , where I (•) is an indicator function. Among all the four combinations of random effects distributions, the model assuming normal subject-specific random effects and normal rater-specific random effects had the smallest AIC (3211.36), with the probability of a correct classification for resident and regional expert being 0.463 and 0.489 respectively. The estimated prevalences and coefficients for the group indicator as well as the misclassification matrices for both groups (with the bootstrap standard errors based on 1000 bootstrapped samples in parentheses) are listed in Tables 6 and 7 respectively. By including the group indicator as covariate, we have smaller AIC than when the two groups are treated as homogeneous, suggesting that the resident and regional expert group might be different at staging endometriosis. More specifically, we can see from Table 7 that when the true disease stage is mild ( d = 2), the probability that regional experts make the right judgement is 0.342, which is significantly higher than 0.151, the probability that residents make the right judgement. This difference in the physician type may also explain why a MixNs distribution is needed for the rater-specific random effects when the 8 physicians are treated homogeneously. It appears that failure to include an important binary rater-specific covariate will induce a bimodal structure in the random effects distribution.

Intro

In public health and medical research, it is of importance to study the diagnostic accuracy of new tests or raters (e.g., [ 3 , 4 ]). This is usually done by comparing the diagnostic results from the tests or raters with the true disease status (i.e., the gold standard). In many cases, however, a gold standard may not be measured due to cost constraints, concerns of the invasive nature of the diagnostic procedure, or a lack of biotechnology to obtain a definitive result. There has been extensive literature on estimating diagnostic accuracy without a gold standard, with the latent class approach (LCA) widely being used (e.g., [ 4 , 5 , 6 ]). It is usually assumed in LCA that the multiple tests or raters are independent conditional on the true disease status. However, the conditional independence assumption may not be valid in practice and models with such an assumption may not fit the data well. Alternatively, Qu et al [ 7 ] developed a LCA by including normally distributed subject-specific random effects to model conditional dependence among binary tests. Albert and Dodd [ 8 ] demonstrated that, when the unknown disease status is binary, the model is weakly identified in the random effects distribution in the sense that different random effects distributions may fit the data equally well. LCA can also be utilized when the true outcome of interest is ordinal rather than binary. Wang et al [ 1 ] extended the work of Zhou et al [ 6 ] on binary outcome to ordered multiple symptom categories and applied it to data from traditional Chinese medicine. In a further extension, Wang and Zhou [ 2 ] incorporated normal subject-specific random effects while assuming fixed effects for the raters. In this paper, we are interested in relaxing the conditional independence assumptionin Wang and Zhou [ 1 ] by proposing crossed subject- and rater-specific random effects to account for the dependence structure in the data. We are interested in assessing the robustness of the proposed models to misspecifications in the Gaussian random effects by considering a mixture of normals for both subject- and rater-specific random effects. This article is motivated by the Physician Reliability Study (PRS) [ 9 ] that investigated the reliability of endometriosis between different physicians and settings. In the PRS, 12 physicians in obstetric and gynecology (OB/GYN) separately reviewed participant clinical information (digital intra-uterus image taken during laparoscopy, surgeon notes, MRI and histopathology reports) and assessed the endometriosis staging. Each physician conducted the review in a sequence of four settings, with each successive setting having an additional piece of clinical information to the reviewing physicians. In this article, we evaluate the diagnostic accuracy of 8 physicians (4 regional experts and 4 residents) who are practicing at the same medical center (Utah) when each of them reviewed the digital images (setting 1). Our interest here is evaluating the diagnostic accuracy in the population of these physicians; hence we treat physicians as a random rather than a fixed effect. Endometriosis is a gynecological disorder in women that occurs when cells from the lining of the uterus grow in other area of the uterus. The cause of endometriosis is unknown and the accurate staging of the disease is subject to substantial errors. In this article, we focus on the 5 stagings of endometriosis: no endometriosis, stage I (minimal), stage II (mild), stage III (moderate) and stage IV (severe). In PRS, 79 subjects have complete staging results from the 8 physicians of interest and constitute the study sample. Among the 632(= 79 × 8) reviews, 155(25%) are no endometriosis while 250(40%), 136(21%), 63(10%) and 28(4%) are stages I to IV, respectively. Table 1 presents the averaged conditional sample proportions of endometriosis staging by one physician given the staging by another that are based on 10000 bootstrapped samples (drawn with replacement from the original data set) of the diagnostic results of two arbitrary physicians. As an indication of agreement, the kappa statistics is estimated to be 0.379. More specific substantive questions include (1) do the physicians have worse diagnostic accuracy at higher stages (moderate and severe) than at lower stages (no disease and minimal)? (2) are the extreme stages (no disease, minimal and severe) easier to diagnose than the middle stages (mild and moderate)? (3) how accurate are the physicians at correctly staging endometriosis? Off by only 1 stage? Off by 2 stages? (4) do the two groups of physicians (regional experts and residents) have different misclassification matrices in diagnosing endometriosis? From a statistical methodological perspective, we are interested in evaluating the robustness of our proposed crossed random effects model with respect to misspecification to the random effects distributions. The remainder of this article is organized as follow. In Section 2, we propose a latent class model with crossed random effects for estimating the diagnostic accuracy of ordinal tests for ordinal true disease status. In Section 3, we present a Monte-Carlo EM algorithm for parameter estimation. In Section 4, we perform simulation studies to assess the convergence of the maximum likelihood estimators (MLE) obtained from a Monte-Carlo EM algorithm and assess the operating characteristics of the procedure. In Section 5, we apply the proposed model to data from the PRS in the staging of endometriosis. Conclusions and discussions are provided in Section 6.

Methods

Let Y ij denote the ordinal diagnostic test result of the stages of endometriosis ( Y ij = 0, …, K ) for the i th subject by the j th rater (i.e. physician), i = 1, …, I and j = 1, …, J . In the PRS, there are four stages of endometriosis and we have 79 subjects who had complete diagnoses from four regional experts and four residents; thus K = 4, I = 79 and J = 8. We denote D i as the true disease status of the i th subject, where for each subject D i = 0 denotes absence of endometriosis and D i = 1, 2, 3 and 4 denote the four stages of endometriosis respectively. Investigators are interested in estimating the average accuracy across physician group rather than physician-specific accuracy. We parameterize the cumulative probability by the following model with two crossed random effects ( {b i }, {c j } ) and p covariates: (1) P ( Y i j ≤ k ∣ d i , b i , c j , X i j ) ≡ Φ ( γ d i , k + σ d i b i + τ d i c j + X i j β d i ) , where { b i } i = 1 I are subject-specific random effects with scale parameters σ ≡ { σ d i } d i = 0 D , { c j } j = 1 J are rater-specific random effects with scale parameters τ ≡ { τ d i } d i = 0 D , − ∞ = γ d i − 1 ≤ γ d i , 0 ≤ ⋯ ≤ γ d i , K − 1 ≤ γ d i , K = ∞ are monotonically nondecreasing cut-points for disease status d i , X i j is the i th row of the I × p matrix X j , which is the design matrix of the j th rater, and β d i is the d i -th column of β p × D , the matrix of coefficients for the covariates. Let γ = [ − ∞ − ∞ − ∞ − ∞ γ 0 , 0 γ 1 , 0 ⋯ γ D , 0 ⋮ ⋮ ⋱ ⋮ γ 0 , K − 1 γ 1 , K − 1 ⋯ γ D , K − 1 ∞ ∞ ∞ ∞ ] , be a matrix of cut-off points with rows corresponding to the rating and columns corresponding to the true unknown category. The model specifications above differ from that in Wang and Zhou [ 2 ] in that the rater-specific effect is assumed to be random in addition to the subject-specific effect. This is important since in PRS the physicians were chosen from a population of OBGYNs and it is of interest to obtain the average estimate across the population. The random effects { b i } i = 1 I and { c j } j = 1 J can each be assumed to follow a standard normal distribution. In addition, to allow for a more flexible random effects structure, we can also consider mixture of normals (MixNs). Define G b and G c to be random variables from a Bernoulli distribution with success probabilities λ b and λ c respectively, then (2) g b ( b ∣ G b ) = I ( G b = 1 ) ϕ ( b , μ 1 b , ν b 2 ) + I ( G b = 0 ) ϕ ( b , μ 2 b , ν b 2 ) ; g c ( c ∣ G c ) = I ( G c = 1 ) ϕ ( c , μ 1 c , ν c 2 ) + I ( G c = 0 ) ϕ ( c , μ 2 c , ν c 2 ) , where φ (•, μ, ν 2 ) is the density of normal distribution with mean μ and variance ν 2 . For model identification, we impose the restrictions on g b (•): E [ b ] = E [ E ( b ∣ G b ) ] = λ b μ 1 b + ( 1 − λ b ) μ 2 b = 0 with μ 1 b < μ 2 b ; v a r [ b ] = E [ v a r ( b ∣ G b ) ] + v a r [ E ( b ∣ G b ) ] = ν b 2 + λ b ( 1 − λ b ) ( μ 1 b 2 + μ 2 b 2 ) = 1 with similar restrictions for g c (•). Let π d i = P ( D i = d i ) and note that π d ( d = 0, 1, …, D ) is the prevalence of stage d of the disease. The probability that rater j rates subject i as k given the true disease stage and random effects b i and c j can be expressed as (3) P ( Y i j = k ∣ D i = d i , b i , c j , X i j ) = Φ ( γ d i , k + σ d i b i + τ d i c j + X i j β d i ) − Φ ( γ d i , k − 1 + σ d i b i + τ d i c j + X i j β d i ) . As in the PRS, we assume D = K = 4, which means that the number of rating categories (i.e. stages) is the same as the number of true categories. Similar to Wang et al [ 1 ], extensions to the case when D ≠ K can be considered. Given random effects { b i } i = 1 I and { c j } j = 1 J , the ratings are each conditionally independent and have a multinomial distribution. Thus, the conditional distribution of each measurement is ∏ k = 0 K P ( Y i j = k ∣ D i = d i , b i , c j , X i j ) I ( Y i j = k ) . Let θ = ( γ, σ, τ, π, β )’ be a vector of the unknown parameters. The likelihood is given by (4) L ( θ ∣ y ) = ∫ ⋯ ∫ ∑ d 1 = 0 D ⋯ ∑ d I = 0 D [ ∏ i = 1 I ∏ j = 1 J ∏ k = 0 K P ( Y i j = k ∣ D i = d i , b i , c j , X i j ) I ( Y i j = k ) ] × ∏ i = 1 I π d i g b ( b i ) d b i ∏ j = 1 J g c ( c j ) d c j . Once we obtain θ ^ which is the MLE of θ from ( 4 ), for covariate X = x , we obtain the MLE of P ( Y ~ ≤ k ∣ D ~ = d , b , c , X = x ) by ( 1 ), where we use Y ~ and D ~ to denote the generic test and generic disease status respectively rather than any specific rater or subject. We consider the misclassification matrix as a measure of diagnostic accuracy, where the ( K + 1) × ( D + 1) misclassification matrix has elements given by (5) P ( Y ~ = k ∣ D ~ = d , X = x ) ≡ ∫ ∫ P ( Y ~ = k ∣ D ~ = d , b , c , X = x ) × g b ( b ) g c ( c ) d b d c in the ( k + 1)th row and ( d + 1)th column of the matrix. The misclassification matrix is square ( D = K ) and it can be estimated by ( 5 ) once we obtain θ ^ . By the invariance property of MLE, we can estimate the probability of a correct classification as P ^ ( Y ~ = D ~ ∣ X = x ) = ∑ d = 0 D P ^ ( Y ~ = d ∣ D ~ = d , X = x ) π ^ d , where P ^ ( Y ~ = d ∣ D ~ = d , X = x ) is a function of θ ^ .

Discussion

In this article, we have described a method for evaluating diagnostic accuracy without gold standard by latent class approach with crossed random effects for ordinal tests. This work generalizes Wang and Zhou [ 2 ] to allow for the rater effect to be treated as random rather than fixed and for the incorporation of flexible random effect distributions. We show that estimate of diagnostic accuracy may be sensitive to assumptions on the random effects distributions and that unlike for the binary case (see Zhang et al [ 16 ]), penalized likelihood comparisons can be used for assessing the adequacy of these distributions. The proposed method was motivated by the Physician Reliability Study in staging endometriosis. To that end, we have found that it is important to incorporate covariates in the modeling framework to capture heterogeneity in raters. Substantively, we found that regional experts make better diagnoses than the residents when the true disease is mild, that physicians in PRS are more likely under-staging than over-staging, and that they are better at diagnosing lower stages of endometriosis than higher ones. These interesting findings can be useful in helping training physicians in the field. In the analysis of the PRS, the covariate is binary, making it easy to present the misclassification matrix for both realizations of the covariate. For more complicated case of covariates, we recommend plotting the elements of the misclassification matrix as a function of changes in these covariates. In particular, since we can write the misclassification matrix in a close-form as presented in the Appendix, we can plot each element of it against a continuous covariate. One feature in both our simulation study and the application is that the number of raters is small. With such small numbers of random effects, estimate of the corresponding variance component ( τ d ) can be biased. We indeed observed this bias in a separate simulation study with J = 10 and near unbiasedness with J = 100. For example, with the true τ 4 = 0.1, we obtained τ ^ 4 = 0.26 when J = 10, reflecting substantial bias, and τ ^ 4 = 0.11 when J = 100, reflecting little bias. Estimates of the subject-specific variance component in both cases are unbiased. Interestingly, regardless of J = 10 or 100, the estimated misclassification matrices, the primary focus of our work, are unbiased.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: pmc-nxml ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Condition tags

endometriosis

MeSH descriptors

Data Interpretation, Statistical Diagnostic Tests, Routine Models, Statistical Algorithms Computer Simulation Diagnostic Tests, Routine Endometriosis Endometriosis Female Humans Monte Carlo Method Observer Variation Physicians Radiography Reproducibility of Results

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. The paper's references may be in our DB but unresolved to ``paper_id`` (resolution happens at ingest when the cited DOI matches a row we already have). Run the cross-source citation reconcile pass to retry.

Source provenance

europepmc: last seen: 2026-06-18T06:15:08.409253+00:00
pubmed: last seen: 2026-05-13T22:19:05.543671+00:00
unpaywall: last seen: 2026-05-14T19:30:52.867331+00:00

License: public-domain-us · commercial use OK · attribution required
Courtesy of the U.S. National Library of Medicine