Impact of Correlation on Concordance
During the article’s review process, the following question arose:
How can correlation between FAQ scores and neuropsychological measures inflate concordance estimates?
Introduction
Because this question is not a straightforward one to address conceptually, we conducted a brief simulation experiment to explore it empirically.
This vignette presents the R code and results of a simulation study comparing algorithms’ concordance and PDD rate estimates in a simplified setting. In this scenario, the only differences between algorithms are the operationalisations of IADL deficit (FAQ > 7 vs. FAQ item 9 < 1) and global cognitive deficit (MMSE < 26 vs. MoCA < 26), while all other diagnostic criteria are assumed to be satisfied.
Crucially, within algorithms, we vary correlation between FAQ and MMSE/MoCA in the data-generating process, allowing us to compare estimates under two conditions: one with no correlation (independent scenario) and one with a correlation of the size consistent with that observed in our data (correlated scenario, see Table 1).
Moments
|
Correlations
|
||||
|---|---|---|---|---|---|
| M | SD | FAQ | MMSE | MoCA | |
| FAQ | 4.05 | 4.89 | 1.00 | -0.21 | -0.27 |
| MMSE | 26.69 | 2.22 | -0.21 | 1.00 | 0.63 |
| MoCA | 24.07 | 3.48 | -0.27 | 0.63 | 1.00 |
Data simulation
Code
# Simulation parameters:
n <- 2000 # sample size high enough to get stable estimates
k <- 2000 # enough iterations to see trends
# Moments of the multivariate normal distribution:
mu <- c(FAQ = 4.05, MMSE = 26.69, MoCA = 24.07) # vector of means
sigma <- c(FAQ = 4.89, MMSE = 2.22, MoCA = 3.48) # vector of standard deviations
# Correlation matrix from observed data:
corrs <- matrix(
c(1, -0.21, -0.27,
-0.21, 1, 0.63,
-0.27, 0.63, 1),
nrow = 3,
dimnames = list(
c("FAQ", "MMSE", "MoCA"),
c("FAQ", "MMSE", "MoCA")
)
)
# Get variance-covariance matrix for the correlated case:
sigma_corr <- cor2cov(corrs, sigma)
# Set FAQ/cognition covariances to zero for the independent case:
sigma_indep <- sigma_corr
sigma_indep["FAQ", ] <- c(sigma_indep["FAQ", "FAQ"], 0, 0)
sigma_indep[, "FAQ"] <- c(sigma_indep["FAQ", "FAQ"], 0, 0)
# Set censoring values:
cens <- matrix(
c(rep(0, 3), rep(30, 3)), nrow = 3,
dimnames = list(c("FAQ", "MMSE", "MoCA"))
)
# Set criteria:
crits <- data.frame(
row.names = c("A1", "A2", "A3", "A4"),
IADL = rep(c("FAQ", "FAQ9"), 2),
IADL_thres = rep(c(7, 1), 2),
cognition = c(rep("MMSE", 2), rep("MoCA", 2)),
cognition_thres = c(rep(26, 2), rep(26, 2))
)FAQ, MMSE & MoCA
To approximate the situation examined in the study, we simulate FAQ, MMSE and MoCA data from a common multivariate. In other words, each subject was assumed to have the following vector of latent scores:
where denotes the vector of means and the variance–covariance matrix. Across all simulations, the vector of means was fixed based on observed data (see Table 1):
Two different specification of the variance-covariance matrix were used. The first, based on observed data (Table 1), represents the correlated scenario:
To simulate the independent scenario, covariance between FAQ and MMSE/MoCA was set to zero, yielding:
FAQ item 9
To simulate data for FAQ item 9, we assumed -equivalence of FAQ (i.e., that all items share the same true score, cf. Trizano-Hermosilla and Alvarado (2016)). We then applied probit link to the FAQ total score of each simulated patient to obtain the probability of a positive response to any item:
where is the standard normal cumulative distribution function and is the standardised FAQ score for simulated patient . The observed FAQ item 9 score was subsequently drawn from a binomial distribution via:
If the simulated FAQ item 9 score exceeded the corresponding simulated FAQ total score, it was set at zero. Finally, all simulated FAQ total, MMSE and MoCA scores were left-censored at 0 and right-censored at 30 to reflect real-world range of the respective scales.
Diagnostic algorithms
After generating the data using the process described above, each simulated patient was classified as suffering from PDD (or not) according to all of the following four algorithms:
where denotes the indicator function. In other words, simulated cases fulfilling both conditions in parentheses were classified as having PDD (PDD = 1); all others were classified as non-PDD (PDD = 0).
Data
To ensure reliable results, N = 2000 cases were generated in this experiment. The simulation was based on the observed means, standard deviations and correlations as specified above. The independent variable in this experiment was presence versus absence of correlation between FAQ and cognitive scores. Accordingly, the simulation was conducted under two conditions:
- Independent scenario using (Equation 4),
- Correlated scenario using (Equation 3).
All other simulation parameters (i.e., the number of participants, the vector of variable means, the variances of all variables, and the covariance between MMSE and MoCA) were held constant across conditions.
Statistical analysis
Code
# Function for comparing two algorithms within the same scenario:
compare_algos <- function(dat, a1, a2) {
x <- dat[[a1]]
y <- dat[[a2]]
cm <- caret::confusionMatrix(factor(x), factor(y), positive = "1")
k <- psych::cohen.kappa(cbind(x, y))$kappa
data.frame(
Accuracy = cm$overall["Accuracy"],
BalAcc = cm$byClass["Balanced Accuracy"],
Kappa = k
)
}
# Function for results summary:
summarise_results <- function(d0, d1, r = 1) {
# PDD rates:
rates <- map_dfr(paste0("A", 1:4), function(i) {
c(indep = mean(d0[[i]]),
corrs = mean(d1[[i]])
)
}) |>
mutate(diff = corrs - indep, rep = r) |>
rownames_to_column("algo")
# Concordance:
conc <- expand.grid(
ref = paste0("A", 1:4),
pred = paste0("A", 1:4),
scenario = c("indep", "corrs")
)
for (i in c("Accuracy", "BalAcc", "Kappa")) {
conc[[i]] <- NA
}
for (i in seq_len(nrow(conc))) {
y <- as.character(conc[i, "ref"])
x <- as.character(conc[i, "pred"])
if (conc[i, "scenario"] == "indep") {
conc[i , c("Accuracy", "BalAcc", "Kappa")] <- compare_algos(d0, x, y)
} else if (conc[i, "scenario"] == "corrs") {
conc[i, c("Accuracy", "BalAcc", "Kappa")] <- compare_algos(d1, x, y)
}
}
conc$rep <- r
# Return the results:
list(rates = rates, concordance = as_tibble(conc))
}After we generated the data, the following statistical estimates were computed within each scenario (i.e., independent and correlated):
- PDD rate for each algorithm,
- accuracy for each pair of algorithms,
- balanced accuracy for each pair of algorithms,
- Cohen’s for each pair of algorithms.
This procedure was repeated for 2000 iterations. Subsequently, the distributions of PDD rates and concordance measures were compared between the same (pairs of) algorithms across the two scenarios. Independent samples t-tests were used to compute 95% confidence intervals for the differences between scenarios within each diagnostic algorithm.
Results
Code
# Calculate all results:
res <- lapply(seq_len(k), function(i) {
d0 <- simulate_pdd_data(n, FALSE, mu, sigma_indep, crits, cens) |> # independent scenario
mutate(PDD = as.numeric(PDD)) |>
pivot_wider(names_from = "type", values_from = "PDD")
d1 <- simulate_pdd_data(n, FALSE, mu, sigma_corr, crits, cens) |> # correlated scenario
mutate(PDD = as.numeric(PDD)) |>
pivot_wider(names_from = "type", values_from = "PDD")
summarise_results(d0, d1, i)
})
# Extract PDD rates and concordance separately:
res_rates <- map_dfr(seq_len(k), \(i) res[[i]]$rates)
res_conc <- map_dfr(seq_len(k), \(i) res[[i]]$conc)
# Conduct t-tests for PDD rates:
stats_rates <- res_rates |>
pivot_longer(c(indep, corrs), names_to = "scenario", values_to = "rate") |>
group_by(algo) |>
rstatix::t_test(rate ~ scenario, detailed = TRUE, var.equal = TRUE) |>
mutate(
Algorithm = glue::glue("A{algo}"),
Independent = estimate2,
Correlated = estimate1,
Difference = estimate,
Low = conf.low,
High = conf.high
) |>
dplyr::select(Algorithm, Independent, Correlated, Difference, Low, High)
# Conduct t-tests for concordance metrics:
stats_conc <- map_dfr(c("Accuracy", "BalAcc", "Kappa"), function(y) {
res_conc |>
filter(ref != pred) |>
mutate(Comparison = glue::glue("{ref}{pred}")) |>
group_by(Comparison) |>
rstatix::t_test(formula(glue::glue("{y} ~ scenario")), detailed = TRUE, var.equal = TRUE) |>
mutate(
Index = y,
Independent = estimate1,
Correlated = estimate2,
Difference = estimate,
Low = conf.low,
High = conf.high,
Type = case_when(
Comparison %in% c("A1A3", "A3A1", "A2A4", "A4A2") ~ "Aligned IADL & Mismatched Global Cognition",
Comparison %in% c("A1A2", "A2A1", "A3A4", "A4A3") ~ "Mismatched IADL & Aligned Global Cognition",
Comparison %in% c("A1A4", "A4A1", "A2A3", "A3A2") ~ "Mismatched IADL & Mismatched Global Cognition"
)
) |>
dplyr::select(Index, Type, Comparison, Independent, Correlated, Difference, Low, High)
})The expected values (denoted ) and the differences between the independent and correlated scenarios in PDD rates, accuracy, balanced accuracy and Cohen’s are presented in Table 2, Table 3, Table 4, and Table 5, respectively.
Across all algorithms, the correlated scenario generated higher PDD rates than the independent scenario, with differences of approximately 2-4% (Table 2).
Rate Difference
|
|||||
|---|---|---|---|---|---|
| Algorithm |
E(PDD Rate)
|
Difference |
95% CI
|
||
| Independent | Correlated | Low | High | ||
| A1 | 0.080 | 0.103 | 0.023 | 0.023 | 0.024 |
| A2 | 0.201 | 0.231 | 0.030 | 0.029 | 0.031 |
| A3 | 0.149 | 0.174 | 0.025 | 0.024 | 0.025 |
| A4 | 0.377 | 0.413 | 0.035 | 0.035 | 0.036 |
The effect of the data-generating scenario on concordance indeces was more nuanced. For accuracy, the correlated scenario generally yielded slightly lower estimates than the independent one across most algorithm pairs (Table 3). However, the magnitude of these differences did not exceed 2%.
Accuracy difference
|
|||||
|---|---|---|---|---|---|
| Comparison |
E(Accuracy)
|
Difference |
95% CI
|
||
| Independent | Correlated | Low | High | ||
| Mismatched IADL & Aligned Global Cognition | |||||
| A1A2 | 0.878 | 0.871 | 0.007 | 0.006 | 0.007 |
| A2A1 | 0.878 | 0.871 | 0.007 | 0.006 | 0.007 |
| A3A4 | 0.770 | 0.759 | 0.011 | 0.010 | 0.011 |
| A4A3 | 0.770 | 0.759 | 0.011 | 0.010 | 0.011 |
| Aligned IADL & Mismatched Global Cognition | |||||
| A1A3 | 0.918 | 0.920 | −0.002 | −0.002 | −0.001 |
| A2A4 | 0.792 | 0.790 | 0.001 | 0.001 | 0.002 |
| A3A1 | 0.918 | 0.920 | −0.002 | −0.002 | −0.001 |
| A4A2 | 0.792 | 0.790 | 0.001 | 0.001 | 0.002 |
| Mismatched IADL & Mismatched Global Cognition | |||||
| A1A4 | 0.689 | 0.680 | 0.009 | 0.008 | 0.010 |
| A2A3 | 0.795 | 0.790 | 0.005 | 0.005 | 0.006 |
| A3A2 | 0.795 | 0.790 | 0.005 | 0.005 | 0.006 |
| A4A1 | 0.689 | 0.680 | 0.009 | 0.008 | 0.010 |
Results for balanced accuracy were more heterogeneous (Table 4). In algorithm pairs where both the IADL and global cognition criteria were mismatched (e.g., A1 vs A4), balanced accuracy was consistently higher under the correlated scenario, by roughly 3%.
When the only mismatch concerned the IADL criterion, the direction of difference in balanced accuracy reversed depending on which algorithm was used as the reference (e.g., when A1 predicted A2, the correlated scenario yielded lower balanced accuracy, whereas when A2 predicted A1, it yielded higher values). The magnitude of this effect was higher in the direction of correlated scenario yielding higher estimates.
Finally, when the only mismatch involved global cognition criterion, the correlated scenario led to higher balanced accuracy estimates in three out of four cases. The maximum difference was similar to the other comparison types, reaching up to 3.2%.
Balanced accuracy difference
|
|||||
|---|---|---|---|---|---|
| Comparison |
E(Balanced accuracy)
|
Difference |
95% CI
|
||
| Independent | Correlated | Low | High | ||
| Mismatched IADL & Aligned Global Cognition | |||||
| A1A2 | 0.930 | 0.925 | 0.005 | 0.005 | 0.006 |
| A2A1 | 0.696 | 0.721 | −0.025 | −0.025 | −0.024 |
| A3A4 | 0.861 | 0.851 | 0.010 | 0.010 | 0.011 |
| A4A3 | 0.695 | 0.708 | −0.013 | −0.014 | −0.012 |
| Aligned IADL & Mismatched Global Cognition | |||||
| A1A3 | 0.919 | 0.935 | −0.016 | −0.016 | −0.015 |
| A2A4 | 0.840 | 0.842 | −0.002 | −0.003 | −0.002 |
| A3A1 | 0.742 | 0.779 | −0.038 | −0.039 | −0.037 |
| A4A2 | 0.732 | 0.751 | −0.019 | −0.019 | −0.018 |
| Mismatched IADL & Mismatched Global Cognition | |||||
| A1A4 | 0.791 | 0.798 | −0.007 | −0.007 | −0.006 |
| A2A3 | 0.633 | 0.662 | −0.028 | −0.029 | −0.028 |
| A3A2 | 0.668 | 0.700 | −0.031 | −0.032 | −0.030 |
| A4A1 | 0.591 | 0.614 | −0.023 | −0.023 | −0.022 |
Lastly, Cohen’s showed the clearest and most consistent pattern of results (Table 5). In all cases, the correlated scenario yielded higher values than the independent one, with differences ranging from 0.012 to 0.068.
Cohen’s difference
|
|||||
|---|---|---|---|---|---|
| Comparison |
E(Cohen’s )
|
Difference |
95% CI
|
||
| Independent | Correlated | Low | High | ||
| Mismatched IADL & Aligned Global Cognition | |||||
| A1A2 | 0.507 | 0.548 | −0.041 | −0.043 | −0.040 |
| A2A1 | 0.507 | 0.548 | −0.041 | −0.043 | −0.040 |
| A3A4 | 0.443 | 0.456 | −0.012 | −0.014 | −0.011 |
| A4A3 | 0.443 | 0.456 | −0.012 | −0.014 | −0.011 |
| Aligned IADL & Mismatched Global Cognition | |||||
| A1A3 | 0.598 | 0.666 | −0.068 | −0.069 | −0.066 |
| A2A4 | 0.511 | 0.537 | −0.025 | −0.026 | −0.024 |
| A3A1 | 0.598 | 0.666 | −0.068 | −0.069 | −0.066 |
| A4A2 | 0.511 | 0.537 | −0.025 | −0.026 | −0.024 |
| Mismatched IADL & Mismatched Global Cognition | |||||
| A1A4 | 0.215 | 0.256 | −0.041 | −0.042 | −0.040 |
| A2A3 | 0.294 | 0.353 | −0.059 | −0.061 | −0.057 |
| A3A2 | 0.294 | 0.353 | −0.059 | −0.061 | −0.057 |
| A4A1 | 0.215 | 0.256 | −0.041 | −0.042 | −0.040 |
Conclusions
Based on the simulations presented in this vignette, we can conclude that the correlations between FAQ and global cognition measures of the magnitude observed in our empirical data imply that:
- the correlated scenario is more sensitive to PDD, consistently yielding higher PDD rates,
- the correlated scenario produces accuracy estimates that are similar to those from the independent scenario,
- when considering balanced accuracy, the correlated scenario diverges from the independent one in more nuanced ways, yielding higher estimates in some comparisons and lower in others,
- relative to the independent scenario, the correlated scenario tends to slightly overestimate Cohen’s to a small degree1, and
- overall, the differences between the correlated and independent scenarios are likely to be small in magnitude, unless higher correlations between diagnostic criteria are introduced in the data-generating process.2