Author
Conceptualization/investigation, M.P. and T.T.O.; data curation, C.W. and T.T.O.; formal analysis, M.P.; funding acquisition, B.G., D.K.S., L.C.G., and M.S.; methodology/validation, M.P. and G.W.K.; resources/software, M.P.; project administration/supervision, T.T.O.; visualization, M.P., S.H.B., B.O. and T.T.O.; writing – original draft, M.P. and T.T.O.; writing – review and editing, M.P., C.W., M.D., S.H.B., B.O., B.R.D., K.C.V., B.L., J.C.I., A.J.C., B.G., J.L., D.K.S., G.W.K., L.C.G., M.S., and T.T.O.
Results
The standard algorithm anticlust uses to optimize balance among batches is an exchange method that consists of two steps. 11 , 22 First, it randomly assigns samples to batches, while adhering to cardinality constraints imposed by the researcher. Our algorithm allows any predefined number of samples per batch, whereas the most common requirement is that each batch must have the same number of samples. The initialization step is followed by an optimization process that systematically exchanges samples between batches. Each exchange is chosen in such a way that it leads to the largest possible improvement in balance (or similarity) among batches. Similarity among batches is quantified through a mathematical objective function that is known from cluster analysis; in cluster analysis, however, samples would be assigned in such a way that the objective function is minimal, while in anticlustering it has to be maximal. We used the popular diversity objective, which is based on a mathematical notion of pairwise dissimilarity among individual samples. 12 In particular, the diversity is computed as the sum of all distances among samples from the same batch (see Figure 1 ). As the basic distance measure for computing the diversity, we used the Euclidean or squared Euclidean distance based on features (e.g., age, body mass index [BMI]) of the individuals from whom the samples were obtained. Via binary encoding, the diversity can also incorporate categorical variables, such as race or disease stage, which are often considered as parameters when striving for balance among batches. 8 , 9 To ensure comparable weight of all features when computing the diversity, the input variables should be standardized before computing the pairwise distances. Standardization is particularly useful if the variables differ strongly in their ranges. 23 Figure 1 Illustration of the diversity objective for anticlustering and cluster analysis (A) Twelve hypothetical values of BMI and age in a scatterplot. (B) Euclidean distances between features as a straight line in the two-dimensional space. The Euclidean distance is proportional to the length of the connecting lines in (B). (C) A clustering assignment of the 12 data points to K = 3 equal-sized groups via minimum diversity. (D) An anticlustering assignment of the 12 data points to K = 3 equal-sized groups via maximum diversity. The diversity is computed as the sum of within-cluster distances, which are highlighted in (C) and (D) through connecting lines. Maximizing the Euclidean diversity simultaneously leads to a similar distribution of the input features among batches (for equally sized, balanced batches). Representation of the conversion from numeric features to Euclidean distance and (anti)clustering assignments based on minimum and maximum diversity using the Euclidean distance.
Illustration of the diversity objective for anticlustering and cluster analysis
(A) Twelve hypothetical values of BMI and age in a scatterplot.
(B) Euclidean distances between features as a straight line in the two-dimensional space. The Euclidean distance is proportional to the length of the connecting lines in (B).
(C) A clustering assignment of the 12 data points to K = 3 equal-sized groups via minimum diversity.
(D) An anticlustering assignment of the 12 data points to K = 3 equal-sized groups via maximum diversity. The diversity is computed as the sum of within-cluster distances, which are highlighted in (C) and (D) through connecting lines. Maximizing the Euclidean diversity simultaneously leads to a similar distribution of the input features among batches (for equally sized, balanced batches). Representation of the conversion from numeric features to Euclidean distance and (anti)clustering assignments based on minimum and maximum diversity using the Euclidean distance.
To add must-link constraints to anticlustering, we proposed a new algorithm called Two-Phase Must Link (2PML). Phase 1 is a straightforward adaptation of the standard algorithm for unconstrained anticlustering. To initialize phase 1, we enforced the must-link constraints through a randomized bin packing algorithm. During the following optimization step, we used a modified representation of the dataset in which all samples that must be linked together are treated as a single unit: a must-link clique . Using the modified dataset, the optimization step applies the same exchange process as the unconstrained algorithm, with the exception that exchanges are restricted to cliques of the same size, thus ensuring that batches remain filled according to the cardinality requirements (e.g., equal-sized batches). Phase 2 drops the restriction of performing exchanges between cliques of the same size and systematically improves over the assignment found in phase 1. The STAR Methods section describes both phases of 2PML in detail.
To thoroughly evaluate the usefulness of anticlustering for batch assignment, we compared it to two alternative approaches in a large-scale simulation study: (1) the OSAT method by Yan et al., 8 which already has been used frequently for batch assignment, 24 , 25 , 26 and (2) PS batch assignment (PSBA) by Carry et al., 9 which has been proposed more recently and has not yet been cited by other researchers. As a secondary contribution of the simulation study, we investigated whether the quality of batch assignment is reduced when must-link constraints are included with anticlustering as compared to an unconstrained anticlustering assignment. The simulation was implemented using the R programming language (version 4.4.2 27 on an Intel i9-12900K computer (5.100 GHz × 24) with 32 GB RAM, running Ubuntu 22.04.5 LTS.
To obtain a fair comparison to the alternative approaches, anticlustering was implemented using the default settings of the anticlust package (version 0.8.9–1) that—despite numerous updates—has remained unchanged since its original presentation. 11 We optimized the diversity objective via the exchange method, using the Euclidean distance as the measure of pairwise dissimilarity. Anticlustering subject to must-link constraints used the 2PML algorithm, applying one iteration of phase 1 and one iteration of phase 2, respectively. For OSAT, we used the implementation that is freely available as an R package from Bioconductor (version 1.52.0). 28 We used the default OSAT algorithm optimal shuffle with 5,000 repetitions. We also evaluated the alternative algorithm optimal block , but decided not to include it in the final large-scale simulation, because it provided comparable results while running at a significantly slower pace. Carry et al. 9 provided an R implementation of PSBA that is available from an accompanying online repository ( https://github.com/carryp/PS-Batch-Effect ). In the original publication, the authors presented a version of PSBA that investigates all possible batch assignments and selects the one that minimizes discrepancy in average PSs among batches. However, investigating all possible assignments is only feasible for small datasets and, thus, not generally applicable. In the online repository, the authors therefore included an implementation that randomly generates a user-defined number of batch assignments and selects the best one. For comparability with OSAT, we set the number of random assignments to 5,000.
For the simulation, we generated 10,000 datasets. Datasets were processed via (1) OSAT, (2) PSBA, (3) unconstrained anticlustering, and (4) anticlustering subject to must-link constraints. Because the OSAT method is only applicable to categorical variables, we generated categorical variables in our simulation. For anticlustering and PSBA, the categorical variables were binary-coded before the methods were applied. OSAT directly used the categorical variables as input because its objective function is computed using category frequencies. For each dataset, we randomly determined the number of categorical variables M (2–5), the number of classes per variable P (2–5), the total sample size n (between 50 and 500), and the number of batches K (2, 4, or 10 equal-sized batches). PSBA was only applied for K = 2 and K = 4 batches because the authors’ implementation only allows the assignment to a maximum of four batches. For each dataset, we also simulated dropouts in samples due to technical failures. To this end, 20% of samples were randomly discarded after the assignment was conducted, to test the methods’ ability to preserve balance when some samples cannot be used. For constrained anticlustering, must-link constraints were generated by creating a random integer (between 1 and n ) for each of the n samples, which served as an ID variable; that is, if two samples were randomly assigned the same ID, they were required to be assigned to the same batch. This procedure resulted in a distribution of constraints that was similar to the distribution of constraints in our motivating application: 58% of all samples had no must-link partner, 29% had one must-link partner, 10% had two must-link partners, and 3% had ≥3 must-link partners.
Across the 10,000 simulation runs, the average run time for the competing assignment methods was 0.09 s for unconstrained anticlustering, 0.10 s for anticlustering with the must-link feature, 2.68 s for OSAT, and 27.22 s for PSBA, making anticlustering about 30 and 302 times faster than OSAT and PSBA, respectively. For each simulation run for each of the four methods, we computed a χ 2 test to assess the balance among batches for each of the 2–5 variables, as performed by Yan et al. 8 In this context, a higher p value indicates that there is more balance among batches, i.e., the batches are more similar with regard to the covariates. For 84% of all variables that were generated during the simulation, balance among batches was better when using anticlustering compared to OSAT. Balance was equal in 15% of all comparisons, and only in 0.32% (=110 of all 34,880 pairwise comparisons), OSAT outperformed anticlust. For 78% of all variables that were generated during the simulation, balance was better when using anticlustering compared to PSBA. Balance was equal in 22% of all comparisons, and only in 0.15% (=35 of all 23258 pairwise comparisons), PSBA outperformed anticlustering. Figure 2 illustrates average p values in dependence on the number of variables and the number of batches, when no dropouts occurred or when 20% dropouts occurred. Without dropouts, when increasing the number of variables from 2 to 5, the average p value for OSAT declined from 0.99 to 0.71, whereas the average p value for anticlustering remained >0.99. PSBA also demonstrated a decrease in p value when increasing the number of variables, but less so than OSAT (from 0.99 to 0.91). With 20% dropouts, the overall level of balance was decreased compared to the condition where no dropouts occurred, especially for K = 2 and K = 4. Still, anticlustering maintained the highest level of balance among the competing methods, with average p values never dropping below 0.85 (see Figure 2 ). The worst-case p value observed for a single dataset under dropout conditions was p = 0.1043 for anticlustering. The other methods were more strongly affected by data loss, as evident by much smaller worst-case p values ( p = 0.0001 for OSAT and p = 0.0033 for PSBA). In the online supplemental information , we provide figures illustrating the simulation results while aggregating across the other variables that varied in the simulation (see Figure S1 ). Figure 2 Results of the simulation study by assignment method, number of batches, and number of variables per dataset Average p values from χ 2 tests, based on the number of batches and the number of variables. Higher p values indicate higher balance. Anticlustering maintained a comparable level of balance across all conditions, especially when no dropouts occurred. OSAT’s performance decreased the most with an increasing number of variables.
Results of the simulation study by assignment method, number of batches, and number of variables per dataset
Average p values from χ 2 tests, based on the number of batches and the number of variables. Higher p values indicate higher balance. Anticlustering maintained a comparable level of balance across all conditions, especially when no dropouts occurred. OSAT’s performance decreased the most with an increasing number of variables.
Remarkably, the constrained anticlustering assignment led to better balance than the OSAT and PSBA methods that did not employ any constraints (see Figure 2 ). In general, using must-link constraints hardly affected the balance among batches; on average, the anticlustering assignment subjected to must-link constraints achieved 99.9% of the objective value of the unconstrained assignment. In 74% of all cases, balance was not at all reduced by the must-link constraints. Hence, must-link constraints are not only valuable to researchers but also do not considerably compromise batch balance.
The ENACT center’s work is concerned with researching endometriosis. One of the ENACT projects required assigning 320 tissue samples of women who either have the disease (i.e., case) or not (i.e., controls) to 20 batches (16 samples per batch). To illustrate the application, we prepared a synthetic dataset that resembles the actual dataset, which is not disclosed for reasons of medical confidentiality. In the synthetic dataset, all women who did not have the disease ( n = 50) provided exactly one sample, and all women who have the disease ( n = 89) provided multiple samples: 40 patients provided 2 samples, 25 patients provided 3 samples, 11 patients provided 4 samples, 10 patients provided 5 samples, and 6, 7, or 8 samples were provided by one patient, respectively. In total, the synthetic dataset consisted of 320 samples from 139 unique individuals.
For the assignment, we sought balance with regard to four categorical variables: disease presence (“yes” = case sample; “no” = control sample), disease stage (“none,” “stage I or II,” and “stage III or IV”), clinical site (“UCSF” or other), menstrual cycle phase (proliferative [“PE”] and secretory [“SE”]). Demographic variables (e.g., age and BMI) were also assessed, but not deemed critical covariates for the assignment. However, post-hoc checks revealed that no significant discrepancies occurred in these other variables, even though they were not explicitly considered via anticlustering.
In our actual experiment, we only used the results of the must-link restricted assignment. However, for the purpose of illustration, we also applied unconstrained anticlustering and OSAT to show what level of balance can be achieved when no constraints are imposed. Tables 1 , 2 , and 3 illustrate the distribution of the categorical variables across batches for unconstrained anticlustering, OSAT, and constrained anticlustering, respectively. The tables were generated using the R package tableone. 29 The table cells contain both information about frequencies and percentages (in parentheses), whether a batch represents only individuals whose samples are unique to that batch (“true” or “false”), and p values from χ 2 tests for the comparison of the four relevant covariates across the batches. The unconstrained methods successfully balanced all four variables among batches (all p values > 0.999 for anticlustering and all p values > 0.99 for OSAT). While all assignments were conducted using the 320 individual samples as input—as opposed to the 139 individual patients—we also verified the satisfactory balance on the level of individuals (see the upper rows in Tables 1 , 2 , and 3 ). Table 3 illustrates the level of balance achieved by constrained anticlustering, which ensured that samples belonging to the same patient were assigned to the same batch. Note that in this application, the must-link constraints put rather severe restrictions on the assignments that were possible; for example, up to 50% of a batch had to be occupied with the samples of a single patient. Still, batches turned out to be highly balanced when including them via the 2PML algorithm (all p values > 0.99). Remarkably, our method filled each of the 20 batches with at least one of the 21 individuals with disease phase I or II (see section Individuals in Table 3 ). Table 1 Summary statistics on an individual- and sample-level for the batches assigned using the unrestricted anticlustering method, with p values from χ 2 tests Batch number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 p -value Individuals n 16 15 16 14 16 15 14 16 15 12 16 16 16 15 16 14 14 16 16 16 Endo = yes (%) 14 (87.5) 12 (80.0) 14 (87.5) 12 (85.7) 13 (81.2) 13 (86.7) 12 (85.7) 14 (87.5) 13 (86.7) 9 (75.0) 13 (81.2) 14 (87.5) 13 (81.2) 12 (80.0) 13 (81.2) 12 (85.7) 11 (78.6) 13 (81.2) 14 (87.5) 13 (81.2) 1.000 Site = UC (%) 4 (25.0) 3 (20.0) 3 (18.8) 3 (21.4) 3 (18.8) 3 (20.0) 2 (14.3) 3 (18.8) 3 (20.0) 3 (25.0) 4 (25.0) 3 (18.8) 4 (25.0) 3 (20.0) 3 (18.8) 3 (21.4) 3 (21.4) 4 (25.0) 3 (18.8) 4 (25.0) 1.000 Phase = SE (%) 7 (43.8) 6 (40.0) 6 (37.5) 5 (35.7) 7 (43.8) 7 (46.7) 6 (42.9) 7 (43.8) 6 (40.0) 5 (41.7) 7 (43.8) 6 (37.5) 7 (43.8) 7 (46.7) 6 (37.5) 6 (42.9) 5 (35.7) 7 (43.8) 6 (37.5) 7 (43.8) 1.000 Stage (%) 1.000 None 2 (12.5) 3 (20.0) 2 (12.5) 2 (14.3) 3 (18.8) 2 (13.3) 2 (14.3) 2 (12.5) 2 (13.3) 3 (25.0) 3 (18.8) 2 (12.5) 3 (18.8) 3 (20.0) 3 (18.8) 2 (14.3) 3 (21.4) 3 (18.8) 2 (12.5) 3 (18.8) Stages I and II 4 (25.0) 3 (20.0) 4 (25.0) 4 (28.6) 3 (18.8) 4 (26.7) 3 (21.4) 4 (25.0) 4 (26.7) 2 (16.7) 3 (18.8) 4 (25.0) 3 (18.8) 2 (13.3) 3 (18.8) 3 (21.4) 3 (21.4) 3 (18.8) 4 (25.0) 3 (18.8) Stages III and IV 10 (62.5) 9 (60.0) 10 (62.5) 8 (57.1) 10 (62.5) 9 (60.0) 9 (64.3) 10 (62.5) 9 (60.0) 7 (58.3) 10 (62.5) 10 (62.5) 10 (62.5) 10 (66.7) 10 (62.5) 9 (64.3) 8 (57.1) 10 (62.5) 10 (62.5) 10 (62.5) Batch only has individuals unique to this batch false false false false false false false false false false false false false false FALSE FALSE FALSE FALSE FALSE FALSE N/A Samples n 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 N/A Endo = yes (%) 14 (87.5) 13 (81.2) 14 (87.5) 14 (87.5) 13 (81.2) 14 (87.5) 14 (87.5) 14 (87.5) 14 (87.5) 13 (81.2) 13 (81.2) 14 (87.5) 13 (81.2) 13 (81.2) 13 (81.2) 14 (87.5) 13 (81.2) 13 (81.2) 14 (87.5) 13 (81.2) 1.000 Site = UC (%) 4 (25.0) 3 (18.8) 3 (18.8) 3 (18.8) 3 (18.8) 3 (18.8) 3 (18.8) 3 (18.8) 3 (18.8) 3 (18.8) 4 (25.0) 3 (18.8) 4 (25.0) 3 (18.8) 3 (18.8) 3 (18.8) 4 (25.0) 4 (25.0) 3 (18.8) 4 (25.0) 1.000 Phase = SE (%) 7 (43.8) 6 (37.5) 6 (37.5) 6 (37.5) 7 (43.8) 7 (43.8) 6 (37.5) 7 (43.8) 6 (37.5) 6 (37.5) 7 (43.8) 6 (37.5) 7 (43.8) 7 (43.8) 6 (37.5) 6 (37.5) 6 (37.5) 7 (43.8) 6 (37.5) 7 (43.8) 1.000 Stage (%) 1.000 None 2 (12.5) 3 (18.8) 2 (12.5) 2 (12.5) 3 (18.8) 2 (12.5) 2 (12.5) 2 (12.5) 2 (12.5) 3 (18.8) 3 (18.8) 2 (12.5) 3 (18.8) 3 (18.8) 3 (18.8) 2 (12.5) 3 (18.8) 3 (18.8) 2 (12.5) 3 (18.8) Stages I and II 4 (25.0) 3 (18.8) 4 (25.0) 4 (25.0) 3 (18.8) 4 (25.0) 4 (25.0) 4 (25.0) 4 (25.0) 3 (18.8) 3 (18.8) 4 (25.0) 3 (18.8) 3 (18.8) 3 (18.8) 4 (25.0) 3 (18.8) 3 (18.8) 4 (25.0) 3 (18.8) Stages III and IV 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) 10 (62.5) Table 2 Summary statistics on an individual- and sample-level for the batches assigned using the OSAT method, with p values from χ 2 tests Batch number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 p value Individuals n 15 16 16 16 16 16 15 16 16 16 15 16 16 15 15 15 16 14 16 16 N/A Endo = yes (%) 11 (73.3) 14 (87.5) 14 (87.5) 14 (87.5) 14 (87.5) 14 (87.5) 11 (73.3) 14 (87.5) 15 (93.8) 12 (75.0) 12 (80.0) 14 (87.5) 14 (87.5) 13 (86.7) 12 (80.0) 13 (86.7) 13 (81.2) 11 (78.6) 14 (87.5) 13 (81.2) 0.995 Site = UC (%) 3 (20.0) 3 (18.8) 4 (25.0) 4 (25.0) 4 (25.0) 5 (31.2) 3 (20.0) 3 (18.8) 2 (12.5) 4 (25.0) 3 (20.0) 3 (18.8) 4 (25.0) 3 (20.0) 1 (6.7) 3 (20.0) 5 (31.2) 2 (14.3) 4 (25.0) 3 (18.8) 0.998 Phase = SE (%) 6 (40.0) 7 (43.8) 8 (50.0) 6 (37.5) 5 (31.2) 7 (43.8) 7 (46.7) 7 (43.8) 7 (43.8) 6 (37.5) 6 (40.0) 6 (37.5) 7 (43.8) 5 (33.3) 6 (40.0) 6 (40.0) 6 (37.5) 6 (42.9) 6 (37.5) 6 (37.5) 1.000 Stage (%) 1.000 None 4 (26.7) 2 (12.5) 2 (12.5) 2 (12.5) 2 (12.5) 2 (12.5) 4 (26.7) 2 (12.5) 1 (6.2) 4 (25.0) 3 (20.0) 2 (12.5) 2 (12.5) 2 (13.3) 3 (20.0) 2 (13.3) 3 (18.8) 3 (21.4) 2 (12.5) 3 (18.8) Stages I and II 3 (20.0) 3 (18.8) 3 (18.8) 4 (25.0) 4 (25.0) 3 (18.8) 2 (13.3) 5 (31.2) 4 (25.0) 2 (12.5) 3 (20.0) 3 (18.8) 4 (25.0) 5 (33.3) 4 (26.7) 3 (20.0) 3 (18.8) 4 (28.6) 4 (25.0) 4 (25.0) Stages III and IV 8 (53.3) 11 (68.8) 11 (68.8) 10 (62.5) 10 (62.5) 11 (68.8) 9 (60.0) 9 (56.2) 11 (68.8) 10 (62.5) 9 (60.0) 11 (68.8) 10 (62.5) 8 (53.3) 8 (53.3) 10 (66.7) 10 (62.5) 7 (50.0) 10 (62.5) 9 (56.2) Batch only has individuals unique to this batch false false false false false false false false false false false false false false false false false false false false N/A Samples n 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 N/A Endo = yes (%) 12 (75.0) 14 (87.5) 14 (87.5) 14 (87.5) 14 (87.5) 14 (87.5) 12 (75.0) 14 (87.5) 15 (93.8) 12 (75.0) 13 (81.2) 14 (87.5) 14 (87.5) 14 (87.5) 13 (81.2) 14 (87.5) 13 (81.2) 13 (81.2) 14 (87.5) 13 (81.2) 0.998 Site = UC (%) 3 (18.8) 3 (18.8) 4 (25.0) 4 (25.0) 4 (25.0) 5 (31.2) 3 (18.8) 3 (18.8) 2 (12.5) 4 (25.0) 3 (18.8) 3 (18.8) 4 (25.0) 3 (18.8) 1 (6.2) 3 (18.8) 5 (31.2) 2 (12.5) 4 (25.0) 3 (18.8) 0.994 Phase = SE (%) 7 (43.8) 7 (43.8) 8 (50.0) 6 (37.5) 5 (31.2) 7 (43.8) 7 (43.8) 7 (43.8) 7 (43.8) 6 (37.5) 6 (37.5) 6 (37.5) 7 (43.8) 5 (31.2) 6 (37.5) 7 (43.8) 6 (37.5) 7 (43.8) 6 (37.5) 6 (37.5) 1.000 Stage (%) 1.000 None 4 (25.0) 2 (12.5) 2 (12.5) 2 (12.5) 2 (12.5) 2 (12.5) 4 (25.0) 2 (12.5) 1 (6.2) 4 (25.0) 3 (18.8) 2 (12.5) 2 (12.5) 2 (12.5) 3 (18.8) 2 (12.5) 3 (18.8) 3 (18.8) 2 (12.5) 3 (18.8) Stages I and II 3 (18.8) 3 (18.8) 3 (18.8) 4 (25.0) 4 (25.0) 3 (18.8) 2 (12.5) 5 (31.2) 4 (25.0) 2 (12.5) 3 (18.8) 3 (18.8) 4 (25.0) 5 (31.2) 4 (25.0) 3 (18.8) 3 (18.8) 4 (25.0) 4 (25.0) 4 (25.0) Stages III and IV 9 (56.2) 11 (68.8) 11 (68.8) 10 (62.5) 10 (62.5) 11 (68.8) 10 (62.5) 9 (56.2) 11 (68.8) 10 (62.5) 10 (62.5) 11 (68.8) 10 (62.5) 9 (56.2) 9 (56.2) 11 (68.8) 10 (62.5) 9 (56.2) 10 (62.5) 9 (56.2) Table 3 Summary statistics on an individual- and sample-level for the batches assigned using the anticlustering method with the must-link feature, with p values from χ 2 tests Batch number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 P value Individuals n 7 5 7 9 6 8 6 7 7 5 8 5 8 9 8 7 6 7 6 8 N/A Endo = yes (%) 4 (57.1) 3 (60.0) 5 (71.4) 6 (66.7) 4 (66.7) 5 (62.5) 4 (66.7) 5 (71.4) 4 (57.1) 3 (60.0) 5 (62.5) 3 (60.0) 5 (62.5) 6 (66.7) 5 (62.5) 5 (71.4) 4 (66.7) 4 (57.1) 4 (66.7) 5 (62.5) 1.000 Site = UC (%) 3 (42.9) 2 (40.0) 1 (14.3) 2 (22.2) 2 (33.3) 3 (37.5) 2 (33.3) 1 (14.3) 3 (42.9) 2 (40.0) 3 (37.5) 1 (20.0) 3 (37.5) 3 (33.3) 2 (25.0) 2 (28.6) 2 (33.3) 1 (14.3) 2 (33.3) 2 (25.0) 0.999 Phase = SE (%) 3 (42.9) 1 (20.0) 2 (28.6) 3 (33.3) 2 (33.3) 4 (50.0) 3 (50.0) 3 (42.9) 2 (28.6) 2 (40.0) 4 (50.0) 2 (40.0) 3 (37.5) 4 (44.4) 5 (62.5) 2 (28.6) 2 (33.3) 2 (28.6) 2 (33.3) 2 (25.0) 0.997 Stage (%) 1.000 None 3 (42.9) 2 (40.0) 2 (28.6) 3 (33.3) 2 (33.3) 3 (37.5) 2 (33.3) 2 (28.6) 3 (42.9) 2 (40.0) 3 (37.5) 2 (40.0) 3 (37.5) 3 (33.3) 3 (37.5) 2 (28.6) 2 (33.3) 3 (42.9) 2 (33.3) 3 (37.5) Stages I and II 1 (14.3) 1 (20.0) 1 (14.3) 1 (11.1) 1 (16.7) 1 (12.5) 1 (16.7) 1 (14.3) 1 (14.3) 1 (20.0) 1 (12.5) 1 (20.0) 1 (12.5) 1 (11.1) 1 (12.5) 1 (14.3) 1 (16.7) 1 (14.3) 1 (16.7) 2 (25.0) Stages III and IV 3 (42.9) 2 (40.0) 4 (57.1) 5 (55.6) 3 (50.0) 4 (50.0) 3 (50.0) 4 (57.1) 3 (42.9) 2 (40.0) 4 (50.0) 2 (40.0) 4 (50.0) 5 (55.6) 4 (50.0) 4 (57.1) 3 (50.0) 3 (42.9) 3 (50.0) 3 (37.5) Batch only has individuals unique to this batch true true true true true true true true true true true true true true true true true true true true N/A Samples n 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 N/A Endo = yes (%) 13 (81.2) 14 (87.5) 14 (87.5) 13 (81.2) 14 (87.5) 13 (81.2) 14 (87.5) 14 (87.5) 13 (81.2) 14 (87.5) 13 (81.2) 14 (87.5) 13 (81.2) 13 (81.2) 13 (81.2) 14 (87.5) 14 (87.5) 13 (81.2) 14 (87.5) 13 (81.2) 1.000 Site = UC (%) 3 (18.8) 2 (12.5) 3 (18.8) 3 (18.8) 2 (12.5) 4 (25.0) 3 (18.8) 4 (25.0) 3 (18.8) 2 (12.5) 3 (18.8) 4 (25.0) 3 (18.8) 4 (25.0) 4 (25.0) 3 (18.8) 4 (25.0) 5 (31.2) 4 (25.0) 3 (18.8) 1.000 Phase = SE (%) 6 (37.5) 7 (43.8) 7 (43.8) 6 (37.5) 7 (43.8) 6 (37.5) 5 (31.2) 7 (43.8) 7 (43.8) 6 (37.5) 6 (37.5) 8 (50.0) 7 (43.8) 6 (37.5) 7 (43.8) 6 (37.5) 6 (37.5) 6 (37.5) 6 (37.5) 7 (43.8) 1.000 Stage (%) 0.991 None 3 (18.8) 2 (12.5) 2 (12.5) 3 (18.8) 2 (12.5) 3 (18.8) 2 (12.5) 2 (12.5) 3 (18.8) 2 (12.5) 3 (18.8) 2 (12.5) 3 (18.8) 3 (18.8) 3 (18.8) 2 (12.5) 2 (12.5) 3 (18.8) 2 (12.5) 3 (18.8) Stages I and II 2 (12.5) 2 (12.5) 5 (31.2) 2 (12.5) 4 (25.0) 2 (12.5) 5 (31.2) 4 (25.0) 3 (18.8) 8 (50.0) 2 (12.5) 4 (25.0) 2 (12.5) 2 (12.5) 2 (12.5) 5 (31.2) 5 (31.2) 2 (12.5) 5 (31.2) 4 (25.0) Stages III and IV 11 (68.8) 12 (75.0) 9 (56.2) 11 (68.8) 10 (62.5) 11 (68.8) 9 (56.2) 10 (62.5) 10 (62.5) 6 (37.5) 11 (68.8) 10 (62.5) 11 (68.8) 11 (68.8) 11 (68.8) 9 (56.2) 9 (56.2) 11 (68.8) 9 (56.2) 9 (56.2)
Summary statistics on an individual- and sample-level for the batches assigned using the unrestricted anticlustering method, with p values from χ 2 tests
Summary statistics on an individual- and sample-level for the batches assigned using the OSAT method, with p values from χ 2 tests
Summary statistics on an individual- and sample-level for the batches assigned using the anticlustering method with the must-link feature, with p values from χ 2 tests
An interactive visualization of batch assignment for our synthetic dataset with OSAT and anticlust, the balance on a sample level achieved by these methods, as well as a web-based batch assignment tool, is made available in the Rshiny app “anticlust,” https://anticlust.org/ .
Resource
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Tomiko Oskotsky (
[email protected] ).
This study did not generate new, unique reagents.
• All data generated for this study have been deposited in the Open Science Repository and are available via https://doi.org/10.17605/OSF.IO/EU5GD . • All original code has been deposited in the Open Science Repository and is publicly available via https://doi.org/10.17605/OSF.IO/EU5GD . • Any additional information required to reanalyze the data reported in this article is available from the lead contact upon request.
All data generated for this study have been deposited in the Open Science Repository and are available via https://doi.org/10.17605/OSF.IO/EU5GD .
All original code has been deposited in the Open Science Repository and is publicly available via https://doi.org/10.17605/OSF.IO/EU5GD .
Any additional information required to reanalyze the data reported in this article is available from the lead contact upon request.
Discussion
High-throughput molecular profiling technologies commonly require that samples be processed in batches, while confounding effects among batches should be minimized as much as possible. 8 , 9 Though post-hoc corrections during statistical analysis are possible to compensate for imbalances in covariates, exercising control on an a priori level—i.e., creating batches that are as similar as possible on relevant covariates—is also crucial. 7 Previous methods for batch assignment include OSAT, which can be used to minimize discrepancy on categorical covariates, and PSBA, which minimizes discrepancy among batches with regard to average PSs. We introduced anticlustering as a robust and improved alternative for assigning samples to batches, enabling similarity across categorical and numerical covariates. While anticlustering has already been used in a variety of settings, 14 , 15 , 16 , 17 , 18 , 19 , 20 it has not yet been put to use in the context of batch assignment, which poses additional challenges as the integration of must-link constraints.
We presented a motivating example to illustrate how anticlustering can be used to assign samples to batches effectively. The example reproduced a real-life application at the UCSF-Stanford ENACT. In our application, we strove for similarity among batches regarding (1) case vs. control samples, (2) clinical site, (3) menstrual cycle phase, and (4) disease stage. Anticlustering proved to be an effective tool in order to meet these criteria. We also required that all samples provided by the same person be processed in the same batch to facilitate the comparison of all of an individual’s samples and avoid confounding the analysis with batch effects. For our work, having the ability to assign each of the individual’s samples intentionally to the same batch would allow greater confidence in findings from our comparison of the individual’s samples, as concerns about batch effects would be mitigated. A similar situation is conceivable in the context of assigning students to classes: groups of friends have to be assigned to the same class, while striving for overall similarity among—and heterogeneity within—classes. These must-link requirements could not be fulfilled using existing approaches such as OSAT and PSBA, and therefore, we developed an innovative method (2PML) that combined the theory of constrained cluster analysis 10 with the domain of anticlustering.
Our evaluation revealed that anticlustering provides several critical advantages over alternative approaches. First, the simulation study indicated that in the context of achieving balance in batch assignment tasks, anticlustering outperforms OSAT and PSBA quite considerably ( Figure 2 ). The advantages of anticlustering persisted even when accounting for potential sample dropouts, which had a less pronounced effect on the balance among batches assigned by anticlustering compared to those assigned by OSAT or PSBA. Note that the superior performance of anticlustering was not due to a different tradeoff of speed and quality: The improved anticlustering results were obtained much faster than the results of OSAT and PSBA. Moreover, compared to OSAT and PSBA, anticlustering can be applied more flexibly in a broader range of settings. For example, OSAT’s usage is restricted to using categorical variables as input. Anticlustering can deal with numerical as well as categorical variables, or even a mixture of these types of variables. Categorical variables can (1) be included as numeric variables via binary coding, as we implemented in the current simulation study and application, or (2) as a “hard constraint” that strictly prioritizes balance according to one categorical variable over the other variables. 11 PSBA—at least given the current implementation provided by the authors—is restricted to creating a maximum of four batches. Anticlustering has been implemented to handle an arbitrary number of batches, as well as user-defined batch sizes. The anticlust package also provides additional functionality not discussed here, including the possibility to choose among a variety of anticlustering objective functions, different optimization algorithms (including optimal methods), and a specialized algorithm for very large datasets.
The clear superiority of anticlustering over the alternatives OSAT and PSBA, as indicated by our simulation study, might be met with skepticism. 30 In a strict sense, the results only pertain to the conditions realized in the simulation (e.g., regarding the number of samples or the number and type of variables). Still, we argue that the results reflect a true difference in potency among approaches. In particular, anticlust uses a better optimization algorithm than either OSAT or PSBA. The anticlust package by default relies on a variant of the Lotfi-Cerveny-Weitz (LCW) method. 22 LCW employs a systematic improvement search by exchanging each sample with the most promising candidate currently a part of a different batch. Anticlust even provides more sophisticated optimization algorithms that have been developed recently, such as the bicriterion iterated local search heuristic by Brusco et al., 12 and the three-phase search approach by Yang et al. 31 However, to obtain the most fair comparison in the simulation study, we applied the default method without additional tweaks. Before explaining the differences in the methodological performance among anticlustering, OSAT, and PSBA in more detail, we would like to clarify an important distinction regarding batch assignment methods. Such methods are generally characterized by two components: (1) an objective function that quantifies balance among batches; (2) an optimization algorithm that assigns samples to batches in such a way that balance is maximized. Among anticlustering, OSAT, and PSBA, both components differ. Hence, observed differences in the simulation study cannot unambiguously be attributed to either component. In particular, the decreased performance of OSAT and PSBA does not imply that both components are inferior. Importantly, we believe that the objective functions employed by OSAT and PSBA have merit, but that both methods leave room for improvement regarding their optimization algorithms. First, the PSBA implementation provided by its developers uses a random sampling algorithm. That is, PSBA generates a fixed number of batch assignments at random—possibly while implementing a stratification on one of the input variables—and then selects the one assignment that minimizes discrepancy in PSs among samples. While we believe that discrepancy in PSs is a useful criterion, mere random sampling is generally considered a poor optimization algorithm. 32 OSAT uses a more advanced optimization algorithm than PSBA, applying pairwise exchanges between samples like LCW. However, the exchange method used by OSAT is less sophisticated. In particular, it does not conduct a systematic search that selects the best possible exchange for each sample, but instead randomly attempts a (fixed) number of exchanges.
We introduced anticlustering as an effective method for batch assignment to mitigate batch effects and ensure the robustness of analyses of data from next-generation sequencing and other high-throughput molecular profiling technologies. Additionally, we implemented a custom feature for anticlustering, which allows researchers to specify pairs or groups of samples that must be assigned to the same batch. Anticlustering, including its must-link feature, is freely available via the open-source R package anticlust ( https://CRAN.R-project.org/package=anticlust ) and can be interactively visualized and utilized through the R Shiny app, https://anticlust.org . Via the accompanying package website ( https://github.com/m-Py/anticlust ), additional documentation and community support are available.
While our evaluation highlights the power of anticlustering for batch assignment, some limitations of the approach should be considered. Anticlustering effectively negates differences in observed covariates among batches, but its effectiveness depends on the input data. The relevance of covariates that anticlustering uses for assignment is for the researcher to determine. If important covariates have not been measured, they cannot be balanced. Moreover, anticlustering cannot overcome problems associated with the data themselves. Imprecise measurement and missing data are problems that affect anticlustering just as with any other method of data analysis. The degree to which anticlustering ensures balance also depends on the data. Increasing the number of samples facilitates finding an acceptable balance; using additional covariates or smaller group sizes reduces the balance among batches. 23 Given that these parameters usually cannot be chosen freely, researchers should always carefully inspect the observed balance after applying anticlustering. Another limitation of the anticlustering approach is that it can only be used if all samples are available before batch allocation; anticlustering requires that all covariates have been measured at the start. It cannot be used to fill batches sequentially as additional samples become available. 33 The anticlustering method was neither intended nor equipped to perform batch effect detection or quantification of existing batch assignments. In particular, it cannot inform researchers on the degree of confounding between variables of interest and batches. It can, however, be used to prevent confounding in the first place. Post hoc assessment of summary statistics for key potential covariates can provide insight into how well variables are balanced across batches (for example, provided by the tableone package 29 as used in our example application). Nevertheless, even with the use of anticlust and favorable summary statistics, applying downstream correction methods (e.g., sva and ComBat) remains essential, as unmeasured or latent variables may still introduce batch-related biases that can best be addressed using statistical adjustment for the batch variable. Experimental failures are a common challenge in real-world studies, and significant post-assignment dropout may compromise the intended balance across batches. Beginning with the best possible balance, and identifying and grouping samples based on the least granular categories of key variables may help reduce the impact of such dropout by preserving balance at a broader level.
Declaration
During the preparation of this work, the authors used ChatGPT-4o (released on May 13, 2024, by OpenAI) in order to refine language and enhance readability. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
Introduction
With advances in high-throughput technologies in recent decades, increasing amounts of data have become available for research, including genomics and epigenomics, bulk, single-cell, and single-nucleus transcriptomics, proteomics, and metabolomics. High-throughput sequencing (i.e., “next generation sequencing”), for example, is a powerful method for sequencing large amounts of DNA and RNA that is faster and more cost-efficient than traditional sequencing techniques (e.g., Sanger method) and has contributed to an acceleration of the rate of scientific discovery. Unless the number of samples is sufficiently small to fit in a single batch, samples that undergo high-throughput analyses are generally processed in multiple batches. However, systematic technical factors (“batch effects”) can arise when samples are processed in different batches. 1 , 2 , 3 Proper experimental design and statistical methods are required to mitigate batch effects as data variation due to these non-biological factors can mask the actual biological differences. 1 , 2 , 3 , 4 Unless batch effects are appropriately addressed, conclusions from downstream analyses can be misleading or erroneous. 1 , 2 , 3
Computational packages, including sva 5 and ComBat, 6 can perform batch correction; however, they cannot correct issues with batch design. For instance, if all cases are in one batch and all controls in another, distinguishing biological from technical effects remains difficult. Hence, careful experimental design, such as balancing classes and avoiding uneven batch sizes, is essential. 1 , 2 , 3 , 7 Resources such as Optimal Sample Assignment Tool (OSAT) 8 and, more recently, a propensity score (PS)-based method 9 have been proposed to facilitate the allocation of samples to batches to minimize the impact of batch effects. While these approaches have been applied to facilitate real-life batch allocation problems, they come with limitations that reduce their applicability. For example, OSAT is limited to handling categorical variables, while numeric variables are not supported (or must be categorized artificially). The PS-based approach has so far only been implemented to handle up to four batches, while real-life applications in high-throughput sequencing often involve more batches. Moreover, neither approach implements additional constraints that can be crucial for obtaining valid inferences in experiments. Just as teachers may group students into balanced teams while keeping some together, researchers aim to assign samples to balanced batches while keeping certain samples (e.g., from the same individual) in the same batch. Sometimes, it is necessary to balance key features across batches while ensuring specific samples remain grouped to avoid batch effects. For example, if researchers have multiple samples per patient and wish to compare both within and across patients, all samples from a given patient should ideally be assigned to a single batch, while maintaining the overall batch balance. In the context of cluster analysis, these kinds of requirements have been termed as must-link constraints, 10 and implementing them can be beneficial in batch assignment.
In this paper, we propose using anticlustering as an automated and improved process to assign samples to balanced batches. Anticlustering is a method that divides a set of elements into disjunct groups, with the aim of maximizing similarity among the different groups. 11 , 12 , 13 Papenberg and Klau 11 presented the free, open-source R package anticlust that implements a variety of algorithms for anticlustering. Since its introduction, it has been used for diverse tasks in a wide range of research fields. 14 , 15 , 16 , 17 , 18 , 19 , 20 To the best of our knowledge, however, anticlustering has not been previously applied for batch assignment in high-throughput sequencing or other molecular profiling technologies. To close this gap, we provide readers with an introduction to anticlustering, putting a practical focus on batch assignment in high-throughput sequencing. Moreover, we introduce an innovative method and implementation to include the must-link constraints with anticlustering.
To evaluate the usefulness of anticlustering for batch assignment, we conducted an extensive simulation study that compared the anticlust package 11 to the existing OSAT package and the PS-based method. We demonstrated that anticlust’s ability to assign samples to balanced batches is better than that of these existing methods, and anticlust’s must-link constraints did not decrease batch balance considerably. Anticlust batch-assignment and must-link features have already been implemented by the University of California, San Francisco (UCSF)-Stanford Endometriosis Center for Discovery, Innovation, Training and Community Engagement (“ENACT,” https://enactcenter.org/ ) for the center’s projects leveraging single-nuclei transcriptomics and other molecular profiling technologies to elucidate mechanisms involved in endometriosis. 21 Readers can find documentation, code examples, and all data needed to reproduce our analyses at https://osf.io/eu5gd/ . All methods are available in the open-source R package anticlust ( https://cran.r-project.org/package=anticlust ).
Coi Statement
The authors declare no competing interests.
Star★Methods
REAGENT or RESOURCE SOURCE IDENTIFIER Software and algorithms Anticlust Paperberg and Klau 11 https://doi.org/10.32614/CRAN.package.anticlust https://github.com/m-Py/anticlust must-link anticlustering This paper https://doi.org/10.17605/OSF.IO/EU5GD OSAT Yan et al. 8 https://doi.org/10.18129/B9.bioc.OSAT http://bioconductor.org/packages/OSAT/ PSBA Carry et al. 9 https://doi.org/10.1186/s12859-023-05202-6 https://github.com/carryp/PS-Batch-Effect tableone Yoshida et al. 29 https://doi.org/10.5281/zenodo.814880 https://cran.r-project.org/web/packages/tableone/index.html Deposited data must-link anticlustering: synthetic dataset Papenberg et al. 23 https://doi.org/10.17605/OSF.IO/EU5GD
Anticlustering is an optimization method that is characterized by (a) an objective function that quantifies the balance among batches, and (b) an algorithm that conducts the batch assignment in such a way that balance among batches is maximized. Anticlustering owes its name to the fact that the objective functions it uses are the reversal of criteria used in classical cluster analysis. For example, Späth 13 already recognized that by maximizing instead of minimizing the k-means criterion (the “variance”), he was able to create groups that are similar to each other, and presented it as an improvement over the more intuitive random assignment. 33 Brusco et al. 12 recognized that other objective functions known from cluster analysis can also be implemented in the context of anticlustering.
In our application, we optimized the diversity objective to maximize similarity among batches. While diversity is technically a measure of within-batch heterogeneity, its maximization simultaneously leads to minimal difference between the distribution of the input variables among batches (at least when batches are equal-sized). 23 , 34 Papenberg and Klau 11 referred to the maximization of the diversity as “anticluster editing” because the minimization of the diversity is also well-known from the area of cluster analysis—under the term “cluster editing.” 35 , 36 The diversity is computed on the basis of a measure of pairwise dissimilarity among samples. In particular, it is defined as the overall sum of all dissimilarities among samples that are assigned to the same batch. 12 Hence, the diversity is not directly computed on the basis of sample features, but instead it relies on a distance measure that is computed on the basis of these features, for each pair of samples. In the context of anticlustering, the Euclidean distance is the most common measure that translates features to pairwise dissimilarities. 11 , 37 The Euclidean distance is defined as d ( x , y ) = ∑ i = 1 M ( x i − y i ) 2 , where M is the number of numeric features describing two samples x = (x 1 , …, x M ) and y = (y 1 , …, y M ) . When samples are described by two features, the Euclidean distance corresponds to the geometric, “straight line” distance between two points in a two-dimensional space; more similar data points are closer to each other (see Figure 1 ). For categorical variables, we use one-hot binary encoding 38 before including them in the computation; in our example application, we used the squared Euclidean distance on binary encoded features as input for anticlustering.
An anticlustering algorithm assigns samples to batches in such a way that the objective function—here, the diversity—is maximized. Anticlustering usually employs heuristic optimization algorithms. 31 While heuristics generally provide satisfying results in the context of anticlustering, 11 they do not guarantee finding the globally best assignment among all possibilities. In principle, enumerating all possible assignments is a valid strategy to obtain an optimal assignment. However, this approach quickly becomes impossible due to an exponential growth of the way in which assignments can be conducted. 11 Moreover, because anticlustering problems (with the exception of some special cases) are also NP-hard, 34 there is likely no algorithm that identifies the globally best assignment without considering all possibilities (at least in the worst case). In practice, heuristics are therefore indispensable.
In the anticlust package, a heuristic exchange algorithm is by default used to maximize the diversity. 11 , 13 , 22 It consists of two steps: an initialization step and an optimization step. As initialization, it randomly assigns samples to batches of equal size or user-defined sizes. After initialization, the algorithm selects a sample and checks how the diversity would change if the sample were swapped with each sample that is currently assigned to a different batch. After simulating each exchange, it realizes the one exchange that increases the diversity the most. It does not conduct an exchange if no improvement in diversity is possible. This procedure is repeated for each sample and it terminates after the last sample is processed. The procedure might also restart at the first element and reiterate through all samples until no pairwise exchange can lead to any further improvement, i.e., until a local maximum is found. In anticlust, we also implemented this local maximum search, which corresponds to the algorithm LCW. 22 For better results, it is also possible to restart the search algorithm multiple times using different (random) initializations. 13
Previously, two other approaches have been proposed to assign samples to batches with the aim of maximizing balance among batches: the “Optimal Sample Assignment Tool” (OSAT) 8 and propensity score based method, which we refer to as “Propensity Score Batch Assignment” (PSBA). 9 Both methods differ with regard to the objective function used to quantify batch balance as well as the optimization algorithm that assigns samples to batches. OSAT’s objective function compares the expected distribution of a categorical variable to the observed distribution, which is highly similar to the well-known χ 2 statistic. Higher values indicate that the variables are not uniformly distributed among batches. Notably, the OSAT objective function per se is only defined for one categorical variable. If multiple variables are present, all variables are therefore “merged” into a single variable. For example, two variables gender (female, male) and smoker (yes, no) would be recoded into one variable with four levels, (1) smoking female, (2) non-smoking female, (3) smoking male, (4) non-smoking male. If the variables are not perfectly balanced, merging removes some information on the distribution of the original variables, partially explaining the decreased performance of OSAT in our simulation study (see Figure 2 ). To optimize the objective function, OSAT first uses a stratified random assignment of samples to batches and then performs a fixed number of reshuffles to optimize the assignment. Each reshuffles randomly selects k samples (where k = 2 is the default) and swaps them between batches. Reshuffles that improve the objective are retained, while reshuffles that do not improve the objective are discarded.
The PSBA objective function is based on computing propensity scores for all samples; using individual propensity scores, the method attempts to minimize the discrepancy between average propensity scores among batches. Propensity scores are computed using a logistic regression model and are defined as the probability that a sample is allocated to a batch given the covariates, versus it being assigned to a different batch. Hence, minimizing discrepancy in average propensity scores simultaneously balances the distribution of covariates among batches. For their own application, Carry et al. were able to investigate all possible sample-to-batch assignments and chose the globally best assignment. However, this approach is only feasible for very small datasets. Therefore, they also implemented a random sampling approach that generates a fixed number of random assignments and chooses the best among them. It is possible to define a stratification variable such that balance according to this (categorical) variable is prioritized during sampling.
In our application, samples belonging to the same patient were required to be assigned to the same batch. We refer to a set of samples that must be linked together within the same batch as a must-link clique . We use the term singleton to refer to samples that are not part of a clique. Basically, anticlustering with the must-link feature uses a modified representation of the original dataset in which singletons and cliques are used as input elements—as opposed to unconstrained anticlustering, where each individual sample constitutes an input element. To implement the must-link requirements while striving for similarity among batches, we developed the Two-Phase Must-Link (2PML) algorithm for anticlustering.
The first phase of 2PML is an adaptation of the local maximum search algorithm LCW, which however ensures that must-link constraints are respected. Some adjustments of the algorithm are required to ensure (a) we obtain a valid partitioning regarding the cardinality constraints (e.g., equal-sized batches) and (b) the diversity is computed correctly during optimization. We therefore had to adjust both the initialization phase as well as the optimization phase of the algorithm. During initialization, we first assign all cliques to batches. Each clique must be assigned completely to one of the batches and samples within a clique must not be split apart. At the same time, the maximum capacity of each batch must not be exceeded. Using this conceptualization, the initialization step corresponds to a bin packing problem, which is one of the classical NP-complete problems in computer science. 39 That is, we assign a weight to each clique, corresponding to the number of samples it contains. When filling batches, the sum of the weights of the cliques in each batch must not exceed its capacity. Many exact and heuristic bin packing algorithms have been developed. As the default method, we use a randomized first fit heuristic to fill batches: For each clique, we iterate through all batches in random order and assign it to the first batch where it fits. The process is expected to evenly distribute the must-link cliques among batches. This random component is particularly useful if we apply multiple restarts of the optimization algorithm. After assigning cliques to batches, the singleton samples can be assigned arbitrarily to fill the remaining space. Note that our randomized first fit algorithm is a heuristic that may not find an assignment of must-link groups to batches even if one is theoretically available. If the heuristic indicates that the batches cannot hold the must-link cliques, we therefore use an exact algorithm based on integer linear programming (ILP) as a fallback option, which allows us to verify if the constraints really cannot be satisfied. To this end, we implemented an adaptation of the standard bin packing ILP model by Martello and Toth 40 : (Equation 1) minimize ∑ 1 ≤ i ≤ n ∑ 1 ≤ j ≤ k x i j , (Equation 2) subject to ∑ i = 1 n w i x i j ≤ c j j = ( 1 , … . , K ) , (Equation 3) ∑ j = 1 K x i j = 1 i = ( 1 , … . , n ) , (Equation 4) x i j ∈ { 0 , 1 } i = ( 1 , … . , n ) , j = ( 1 , … . , K ) .
The number of cliques is given by n . The model has binary decision variables x ij to encode whether clique i ( i = 1, …, n ) is assigned to batch j ( j = 1, …, K ). The model uses K values c j to represent the capacity of each batch. The model has n values w i to encode the weight of each clique, i.e., the number of samples it represents. Constraint (2) ensures that the weight of each batch is not exceeded; constraint (3) ensures that each clique is assigned to exactly one batch. Note that during the initialization step, we only need to test if the constraints (2) and (3) can be satisfied; in the context of must-link anticlustering, any feasible assignment is equally valid. For this reason, the objective function (1) is chosen to be constant for each assignment that satisfies the constraints. It does not actually contribute to solving the problem, and the model therefore only tests if the must-link constraints can be satisfied. In anticlust, we call one of the programs gurobi, 41 Symphony, 42 , 43 GLPK, 44 or lpSolve 45 to yield an optimal assignment for the bin packing problem as initialization for the must-linked samples.
For the optimization phase of the exchange algorithm, we generate a modified representation of the original distance matrix that preserves all relevant information. To this end, we compute new pairwise distances between (a) each clique and the other cliques and between (b) each clique and the singleton samples. The new distances are given as the sum of all pairwise distances between (a) the samples in two different cliques, or (b) between all samples of a clique and a singleton. In the context of maximizing the diversity, this transformation sufficiently preserves the relevant information in the original distance matrix. 36 Using the initial assignment and the transformed distance matrix, we apply the same exchange algorithm as we described for the unrestricted anticlustering problem. However, during the exchange process, we only permit exchanges between cliques of the same cardinality (e.g., patients providing the same number of samples) to ensure that the cardinality constraints are respected throughout. As opposed to unconstrained anticlustering, this restriction limits the number of exchanges that are evaluated. Hence, we developed Phase 2 to overcome the limitations of Phase 1.
The second phase of 2PML is an improvement phase that drops the restriction of performing exchanges among must-link cliques of the same cardinality. It is initialized using the assignment obtained in Phase 1. It iterates through all must-link cliques and for each clique, selects a combination of samples for exchange. The combination of samples is chosen in such a way that (a) must-link constraints are preserved, (b) all samples are currently part of the same batch, and (c) the total number of samples is equal to the size of the current clique. Among all combinations of samples that fit these requirements, one combination is randomly chosen. If no feasible combination is found, the next clique is evaluated. If a combination is found that fits the requirements (a)-(c), the effect of exchanging it with the current clique is evaluated. Exchanges that do not lead to an improvement in diversity are discarded; exchanges that lead to an improvement in diversity are retained. For example, for a clique of size 4, we might select four singletons for exchange; a combination of two cliques of size 2; a clique of size 3 and a singleton; or two singletons and a clique of size 2. Note that generating all possible combinations for exchange corresponds to the subset sum problem, which is another classical NP-hard problem in computer science. 39 Due to its computational complexity, we cannot exhaustively generate all exchange combinations for large clique sizes. We exhaustively generate all combinations for cliques of up to 10 samples; for larger cliques, we use a heuristic to generate a reduced set of feasible exchange combinations: We generate a set of equal-sized cliques, possibly filled with singletons to ensure the number of samples is sufficient. For example, if this rule were applied to cliques of size 7 (which it is not, because we generate all possible combinations for cliques of size 7), we would obtain the combinations (1, 1, 1, 1, 1, 1, 1), (2, 2, 2, 1) and (3, 3, 1) for exchange.
After iterating through all cliques, the improved batch assignment is once again used as input for the optimization step of Phase 1. That is because—maybe counterintuitively—while the new assignment is likely improved over the initial assignment that was obtained in Phase 1, it may no longer be locally optimal. It can therefore be restored to local optimality for additional improvement. The idea of restoring local optimality of a perturbed assignment is inspired by the iterated local search by Brusco et al. 12
Both phases of 2PML can be called repeatedly as part of a multiple restart algorithm. In anticlust, when users request multiple restarts, half of the restarts will perform Phase 1 and the other half will perform Phase 2. The best assignment across all restarts in Phase 1 is used as input for the first iteration of Phase 2. The second iteration of Phase 2 will again perform an improvement phase, this time using the output of the first iteration of Phase 2 as input; the third iteration then uses the output of the second iteration, and so forth. The process is repeated until all user-defined restarts are conducted.
Due to the computational complexity of anticlustering, batch assignment problems are usually tackled using heuristic algorithms. Still, for some problem constellations—in particular when n is not large—it is possible to employ exact algorithms that find the globally best batch assignment. Papenberg and Klau 11 presented an ILP model to find globally optimal batch assignments for the diversity objective. It can be used to solve problem instances of up to about n = 30 in acceptable running time; Schulz 46 used a time limit of 1800 s and showed that up to 30 samples can be assigned optimally in this time. In the current paper, we extend the model by Papenberg and Klau 11 to allow it to include must-link constraints. The extension is actually quite straightforward: To induce must-link constraints in the context of an exact algorithm, it is sufficient to adjust the distance matrix used as input. Whenever the pairwise distance between two samples is set to ∞—and the set of must-link constraints can be satisfied—any globally optimal assignment will place these samples in the same batch, because the objective value associated with such an assignment is necessarily better than that of an assignment that places them in different batches. This “trick” of adjusting the data input to induce must-link constraints has long been used in the context of exact algorithms for cluster editing. 36 Including must-link constraints in anticlustering increases the number of samples that can be processed, because it reduces the effective number of samples. In our online supplementary materials, we show that up to about 40 samples can be processed using a time limit of 1800 s on a desktop computer (see Figure S2 ); beyond that, the running time is expected to increase exponentially with an increasing number of samples.
Our simulation study compared anticlustering, must-link constrained anticlustering, OSAT and PSBA regarding their ability to achieve balance between batches. All code to reproduce the simulation and its analysis are available from the accompanying Open Science Repository ( https://doi.org/10.17605/OSF.IO/EU5GD ). The simulation performed 10,000 iterations. During each iteration, a synthetic dataset was randomly generated that was processed by each of the four methods. We generated categorical variables having 2–5 levels. Each dataset contained 2–5 of such categorical variables. The total sample size n (50–500), the number of variables (2–5), the number of categories per variable (2–5), and the number of batches (2, 4, 10) was randomly chosen from a uniform distribution for each iteration of the simulation.
To quantify balance among batches, a standard χ 2 test was performed for each variable in each dataset. The χ 2 test quantifies the discrepancy between the observed distribution of a categorical variable across batches, versus an assumed uniform distribution. The p -value associated with the χ 2 test quantifies the “extremeness” of the observed distribution. Higher p -values indicate higher balance among batches. With entirely random assignment of samples to batches, all p -values would be equally likely. Small p -values imply a small probability that the observed distribution (or a more extreme distribution) of the variable would be generated under true uniform assignment. In an inferential statistics framework, one therefore discards the null hypothesis of true random sampling if the p -value is below a certain threshold (e.g., 0.05), which is well-known among applied researchers. Probably less well-known is that consistently observing very high p -values is also inconsistent with true uniform sampling, even though these high p -values indicate that the observed distribution of a categorical variable is indeed uniform among samples. This is because—maybe counterintuitively—a high p -value (e.g., close to 1) indicates that even when sampling from a true uniform distribution, we would usually expect an observed distribution that deviates more strongly from a uniform distribution. With non-random assignment (e.g., anticlustering), p values tend to be rather high and oftentimes approach 1 (see Figure 2 ). With random uniform sampling, the expected value of the p value would be 0.50.
In our simulation, p -values were obtained for each of the 2–5 variables in each dataset, for each of the four methods and for each of two dropout conditions (no dropout vs. 20% dropout). In each simulation run, we computed at least 16 p values and a maximum of 40 p values for each iteration of the simulation. Thus, the effective n in our simulation was not 10,000, but depended on the number of variables that were generated per simulation run. As a main comparison, we reported the relative proportion of variables where anticlustering outperformed the competing methods, was equal to them, or was inferior. The effective n for this comparison was 34,880 pairs of p values vs. OSAT and 23,258 pairs of p values vs. PSBA. The n was different for each set of comparisons because PSBA could not be applied when 10 batches were assigned. For this main comparison, we only used data from the no dropout condition. As secondary analysis, we generated Figure 2 to depict the average p value by assignment method, number of batches, number of variables, and dropout condition, while aggregating across the other variables and across all simulation runs.
We did not report indices of statistical significance pertaining to the comparison between assignment methods. This was due to (a) the large sample size, which in conjunction with the large observed differences between methods rendered measures of significance largely meaningless, and (b) the ease of reproducing the simulation. Similarly, we chose a fixed but large number of simulation iterations, instead of aiming for a specified level of statistical power. A large data basis was generated allowing for meaningful conclusions, which is common practice in simulation studies where generating additional data is comparably easy. The number of iterations we employed (10,000) is a popular choice in computational studies and was also practically feasible in our particular study: using 10000 iterations, the simulation took a total time of about 3 days on a desktop computer. Note that for a revision of this article, we repeated the entire simulation (using a different random seed and adding the 20% dropout condition) and virtually obtained the same pattern of results. Our accompanying online repository contains the initial dataset as well as the revised (and final) dataset, verifying the high level of reproducibility.
Acknowledgments
This study was funded, in part, by the 10.13039/100000002 National Institutes of Health , specifically the Eunice Kennedy Shriver 10.13039/100000071 National Institute of Child Health and Human Development , grant number P01 HD106414 . The funder played no role in the study design, data collection, analysis, interpretation of data, or the writing of this manuscript. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors sincerely thank Martin Breuer, Ana Maria Deluca, Jessica Gelfand, Hannah Hengelbrock, Grace Loll, Nils Lüschow, Edna Rodas, and Vivian Siu for their valuable contributions to this work.
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.