Heavy-tail-aware representation learning and dynamic Bayesian state modelling to derive an operational proxy definition of problem gambling risk from routine online gambling data

doi:10.21203/rs.3.rs-8889984/v1

Heavy-tail-aware representation learning and dynamic Bayesian state modelling to derive an operational proxy definition of problem gambling risk from routine online gambling data

2026 · doi:10.21203/rs.3.rs-8889984/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 218,575 characters · extracted from preprint-html · click to expand

Heavy-tail-aware representation learning and dynamic Bayesian state modelling to derive an operational proxy definition of problem gambling risk from routine online gambling data | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Heavy-tail-aware representation learning and dynamic Bayesian state modelling to derive an operational proxy definition of problem gambling risk from routine online gambling data Sam Andersson, Helga Westerlind, Timo Koski, Keenan Lyon, Per Carlbring, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8889984/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 6 You are reading this latest preprint version Abstract Background: Problem gambling causes harm, but operational identification often relies on heuristic thresholds or sparse manual reviews. Routine online gambling logs are heavy-tailed and temporally structured, complicating risk definition and early detection. Methods: We analysed de-identified records from an online gambling operator across four streams (transactions, bets, sessions, payments). Time series were summarised into leakage-audited 30-day windows with heavy-tail-aware exceedance frequency and magnitude features. Window embeddings were learned using a hierarchical conditional variational autoencoder: a teacher trained on responsible-gambling proxy signals, then a student fine-tuned on sparse manual analyst assessments on the training split only. To address missing-not-at-random assessments, backlog-aware label inference conservatively augmented training data. Dynamic regimes were inferred from embeddings using a regularised Gaussian hidden Markov model, yielding a three-class operational proxy definition. Agreement with analyst assessments and early-warning utility under explicit capacity constraints were evaluated on held-out labels. Results: Balanced accuracy on labelled test windows ranged from 0.38 (transactions) to 0.62 (payments), with best macro-averaged F1-score in bets (0.54). Under a capacity-constrained top-10-per-week queue, escalation detection ranged from 0.39 (sessions) to 0.62 (bets), with median lead times of 42–290 days. Conclusions: Heavy-tail-aware representations combined with dynamic regime modelling can derive an auditable operational proxy definition of gambling-related risk from routine data and support realistic, capacity-constrained monitoring. problem gambling operational risk definition heavy-tailed behavior representation learning variational autoencoder hidden Markov model early warning capacity-constrained triage missing-not-at-random labels Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction Digital platforms generate dense “digital traces” of human behavior that can be analyzed as first-order objects for understanding and monitoring socio-technical systems ( 1 ). In many operational settings, the core scientific and engineering challenge is not simply prediction, but definition: how to infer a latent, time-varying risk state from heavy-tailed, temporally dependent behavioral data when expert labels are sparse and selectively observed, and when downstream actions are constrained by finite review capacity ( 2 ). Online gambling provides a clear instance of this broader problem. Behavioral logs capture high-frequency patterns of engagement and spending, but activity is dominated by rare extreme days rather than typical averages, and meaningful signals may appear as bursts, clustering, or persistent shifts rather than smooth trends ( 3 , 4 , 5 ). Gambling-related harm is associated with substantial impacts on health, well-being, and socioeconomic outcomes, and a persistent operational challenge is identifying elevated risk early enough to prevent escalation ( 6 ). Many responsible gambling (RG) programs rely on self-report screening tools, voluntary limit-setting, and rule-based triggers such as deposits or losses above fixed thresholds ( 7 , 8 ). These approaches can be delayed (flagging risk after large losses), brittle to behavioral adaptation, and poorly suited to heavy-tailed behavioral distributions where a small number of days account for a large fraction of wagering or losses ( 3 , 5 ). For instance, a customer may exceed a monthly deposit threshold only after incurring substantial losses over many extreme-wagering days, by which time harm may already have occurred ( 3 , 5 ). The rapid digitalization of gambling products has increased “always-on” access and the volume of routine telemetry available for monitoring, strengthening calls for scalable evaluation and oversight infrastructure built on routinely collected data ( 6 , 7 ). Survey-based definitions and psychometric instruments remain essential for screening and prevalence estimation, but they were not designed for near-real-time operational monitoring on platform data ( 9 – 11 ). This motivates complementary operational definitions that treat risk as time-varying, can be evaluated prospectively against outcomes and service responses, and explicitly incorporate resource constraints that determine what monitoring systems can realistically do ( 6 ). In this context, an “operational proxy definition” is a model-derived categorization intended to support consistent triage and evaluation, not to assert a clinical diagnosis. Behavioral and machine-learning approaches using account-based tracking data are increasingly common ( 8 , 12 , 13 ). However, online gambling telemetry is strongly non-Gaussian and temporally structured, and supervisory signals are often sparse and operationally triaged (e.g., manual reviews), creating a missing-not-at-random (MNAR) setting in which the probability of receiving a label depends on behavioral signals and system-level constraints ( 13 – 18 ). In addition, operational feedback loops can arise when existing RG systems influence which customers are reviewed and labelled ( 13 – 15 ). As a result, methods must preserve informative extremes, enforce causal ordering to prevent label leakage, quantify temporal portability, and represent risk as a dynamic process rather than a static score ( 15 , 19 , 20 ). Yet, to our knowledge, no published framework has integrated all of these elements into a single auditable pipeline. In this study, we derive an operational proxy definition of gambling-related risk from routine platform telemetry and triaged analyst assessments, with the explicit goal of enabling auditable early warning and workload planning rather than clinical diagnosis ( 6 ). We propose an end-to-end framework that (i) represents heavy-tailed behavior using interpretable exceedance events (whether extreme activity occurred) and exceedance magnitudes (how extreme it was), summarized over short horizons; (ii) learns leakage-safe, label-aware embeddings of 30-day behavioral windows using a teacher–student conditional variational autoencoder (CVAE) ( 21 – 23 ); (iii) corrects for MNAR manual labelling using a backlog-aware label inference step ( 16 – 18 , 24 ); and (iv) infers time-varying behavioral regimes using a Gaussian hidden Markov model (GHMM) to produce calibrated three-class probabilities for operational use ( 25 , 26 ). Our objectives were to ( 1 ) derive a discrete, interpretable regime-based operational proxy definition of gambling-related risk; ( 2 ) validate alignment to manual analyst assessments under strict leakage controls; ( 3 ) quantify generalization across customers and over time using temporal holdout evaluation; and ( 4 ) evaluate early-warning performance and operating points under explicit capacity constraints that reflect realistic review budgets ( 2 ). Methods Study design and data sources We conducted an observational analysis using de-identified daily behavioral records from customers of a licensed online gambling operator. Four behavioral streams were analyzed separately: transactions (daily financial movements), bets (stakes and wins/losses), sessions (engagement timing and duration), and payments (deposit and withdrawal events). Two supervisory sources were available: manual analyst assessments of gambling risk (five ordinal categories, used as the reference standard for evaluation) and daily responsible-gaming predictions (used as proxy labels for representation learning and feature selection). These data characteristics, high-frequency behavioral logs, sparse expert labels, and heavy-tailed distributions motivated a staged analysis pipeline designed to separate data preparation, representation learning, label correction, and dynamic modelling. Pipeline overview and order of operations The analysis was implemented as a deterministic, leakage-controlled pipeline with persisted artefacts between stages. Processing followed a fixed sequence designed to enforce causal ordering and reproducibility: ( 1 ) preprocessing with hybrid window construction and explicit auditing for label leakage; ( 2 ) feature engineering tailored to heavy-tailed behavioral data, with parameters and transformations frozen across splits and window types; ( 3 ) feature selection restricted to normalized and derived features to reduce redundancy while preserving short-horizon temporal signal; ( 4 ) representation learning using a hierarchical conditional variational autoencoder (CVAE) trained in a teacher–student regime with risk-specific priors; ( 5 ) backlog-aware label inference (BALI) to augment training data under missing-not-at-random (MNAR) manual assessment; and ( 6 ) dynamic regime modelling using a Gaussian hidden Markov model (GHMM) to derive a three-level operational proxy definition of risk. Design rationale: The pipeline is deliberately multi-scale and tail-focused. Gambling telemetry tends to be heavy-tailed and bursty, and dependencies can cascade across streams and time scales (e.g., long sessions co-occurring with high stakes and followed by payment events) ( 3 , 4 , 5 ). Our guiding principle was to align each modelling component with this structure: short-horizon exceedance features preserve informative extremes; the hierarchical CVAE compresses short-, mid-, and long-range dynamics into separate latent blocks; and the GHMM then treats these embeddings as noisy observations of a slowly evolving latent regime. All transformations, thresholds, and tuning decisions were estimated using the training split only (or training folds under cross-fitting) and then applied unchanged to validation, test, and temporal hold-out sets ( 27 ). Early stages of the pipeline focus on constructing leakage-safe analysis units and stable feature representations before any label-dependent modelling is performed, ensuring that downstream performance reflects generalization rather than artefacts of temporal or supervisory leakage. Windowing, splits, and leakage controls Behavioral time series were converted into rolling 30-day analysis windows with a one-day stride using two complementary constructions. Activity-based windows comprised 30 sequential daily records and were split at prolonged inactivity gaps (runs of ≥ G consecutive days with no recorded activity; G = 6), maximizing coverage for unsupervised and proxy-supervised representation learning while preserving within-window temporal structure. Calendar-based windows comprised 30 consecutive calendar days (allowing limited inactivity) and were aligned to manual analyst assessment dates to support supervised evaluation on a fixed temporal scale. We used 30-day windows (and, where relevant, multiples thereof) to match operational reporting and to capture approximately monthly periodicities in gambling behavior (e.g., pay-cycle effects). In practice, this means that for each manual assessment, we defined a fixed look-back window ending before the assessment date. To minimize dependence and prevent information leakage, we applied a hybrid splitting strategy. Customers were assigned exclusively to training, validation, or test sets, ensuring cross-customer independence. Within training customers, the most recent fraction (latest 20%) of each individual timeline was further reserved as a temporal hold-out to evaluate short-term generalization under realistic forward-in-time (“future-risk”) conditions. ( 20 ) A strict label-leakage audit was then applied across all splits and window types. Any window for which a manual assessment date fell within the corresponding 30-day interval (start date ≤ label date ≤ end date) was excluded, treating labels as occurring after the predictive window. This audit enforces causal ordering by ensuring that no window contains information contemporaneous with, or subsequent to, its supervisory label ( 20 ). All subsequent feature construction, representation learning, and regime modelling were performed exclusively on these leakage-controlled windows. Study design, data splitting, leakage audit rules, and the complete analysis pipeline are summarized in Fig. 1 . Preprocessing and enhanced heavy-tail feature engineering Preprocessing was designed to correct implausible data artefacts while preserving genuine heavy-tailed behavior characteristic of gambling activity. We applied systematic audit rules to identify and correct implausible records (e.g. sign errors and impossible values; Appendix A), enforced consistent handling of raw and transformed variables, and retained large magnitudes without aggressive truncation to avoid distorting tail behavior. For heavy-tailed metrics, we represented extreme behavior using per-feature exceedance constructions inspired by peaks-over-threshold (POT) ideas ( 28 ). Per-feature thresholds derived from the training data were used to generate (i) exceedance indicators, capturing whether unusually large events occurred, and (ii) exceedance magnitudes, defined as how far observations exceeded the threshold. Excess magnitudes were log-transformed to stabilize scale while preserving relative intensity (Appendix A). This representation focuses explicitly on behavior beyond high thresholds and helps retain information about rare but operationally meaningful events that can be obscured by global transformations or aggressive truncation. Extremes were then summarized over multiple short horizons using block statistics and rolling aggregates, such as weekly exceedance counts and maximum excess. This representation preserves separate information about the frequency and severity of extreme events, rather than collapsing them into a single scale-normalized summary. To characterize temporal dynamics beyond raw magnitude, we additionally engineered scale-free momentum features using standardized normalized partial sums (SNPS), which help distinguish sustained patterns of activity from short-lived spikes ( 29 ). We also included lagged tail-dependence interaction features (χ-lift) to capture short-horizon cascades across behavioral dimensions, such as clustered extreme behavior occurring close in time ( 30 ). All preprocessing parameters—including imputation values, tail thresholds, selected interaction pairs, and final feature ordering—were estimated on the training split only and frozen via a consistency manager before application to validation, test, and temporal hold-out windows. This process resulted in a high-dimensional but stable feature representation, motivating a subsequent feature-selection stage to reduce redundancy while retaining the most temporally informative signals prior to representation learning. Feature selection To reduce dimensionality and improve downstream stability, we performed feature selection after enhanced feature engineering and before representation learning. Selection was restricted to normalized and derived features; raw base columns were excluded by design to avoid scale artefacts and leakage from untransformed heavy-tailed magnitudes. The inactivity-gap feature (days since last activity) was always retained, as it encodes operationally relevant disengagement independent of monetary intensity. Feature relevance was ranked using a multi-criterion, sequence-aware approach combining unsupervised measures of temporal structure with auxiliary responsible-gaming (RG) signals. Unsupervised criteria quantified within-window temporal variability, cross-feature redundancy (absolute correlations), and structural contribution (PCA loadings). We additionally estimated temporal predictability by fitting a next-window forecasting model (LSTM in our implementation) ( 31 ) and computing permutation-based importance ( 32 ). To bias selection toward operationally meaningful signals without using study outcome labels, we overlaid weak supervision from RG model outputs, including association with continuous RG scores and sensitivity around predicted RG category transitions. Final rankings were obtained by weighted aggregation of these scores while retaining a small set of predefined stability features. The final selected feature set was determined on the training split only and then fixed and applied unchanged to validation, test, and temporal hold-out data. This procedure reduced redundancy while preserving short-horizon temporal dynamics and ensured that downstream representation learning operated on a stable, leakage-controlled feature basis. Implementation details for each criterion are provided in Appendix. Representation learning with a hierarchical conditional VAE and teacher–student training Using the fixed, selected feature set, we learned low-dimensional representations of each 30-day window for downstream temporal modelling. Fixed-length embeddings were obtained with a hierarchical conditional variational autoencoder (CVAE) ( 21 , 22 ) with three parallel causal temporal convolutional network (TCN) encoders (short/mid/long receptive fields) ( 33 ). Each encoder produced a sequence of hidden states which was contextualized with multihead self-attention ( 34 ) and summarized by masked mean pooling to yield a scale-specific representation. Latent parameters were produced hierarchically, with the mid-level latent conditioned on the short latent and the long-level latent conditioned on both short and mid; the three latent vectors were concatenated to form the final embedding. To stabilize risk semantics in latent space, the CVAE used class-specific diagonal-Gaussian priors at each temporal scale, with one learned prior per risk category (plus an explicit ‘unknown’ prior), rather than a single standard normal prior. This encourages embeddings associated with similar risk levels to occupy coherent regions of the latent space while preserving overlap for borderline cases. Training followed a two-phase teacher–student design to decouple representation learning from supervision on sparse manual labels. In phase 1, a teacher model was trained on activity-based windows using RG-derived proxy signals, combining masked reconstruction with auxiliary prediction heads for RG score (regression) and RG category distribution (classification). In phase 2, a student model was adapted to calendar-based windows aligned to manual analyst assessments: the TCN encoders and decoder were frozen, and only the manual classification head together with the final latent projection layers were updated, yielding a deliberate “fine-tuning” step that preserves the coarse latent geometry learned by the teacher while adapting decision boundaries to gold labels. Student training used cross-entropy with class-balanced sampling, with an additional small ordinal shaping term and (optionally) a lightweight distillation loss aligning student predictions to the teacher RG head ( 23 ). To avoid posterior collapse and promote stable latent utilization, an issue common in VAE training ( 35 , 36 ), training used KL annealing (warm-up) together with free-bits thresholding ( 37 ), spectral normalization ( 38 ), and a diversity regularize on the aggregated latent covariance. Latent utilization was monitored with a participation-ratio diagnostic to detect collapse. During calendar-window adaptation and embedding extraction, the encoder was always conditioned on an explicit ‘unknown-label’ token to prevent oracle label conditioning. The resulting embeddings were then treated as fixed representations and used as inputs for all subsequent label inference and temporal regime modelling. Implementation details are provided in Appendix. Backlog-aware label inference (BALI) for MNAR manual labels We introduce backlog-aware label inference (BALI), a conservative pseudo-labelling step applied after embedding extraction to mitigate selection bias arising from operational triage of manual analyst assessments. Manual labels were treated as missing not at random (MNAR): the probability that a window receives a manual rating depends jointly on behavioral signals, summarized by the learned embeddings, and on system-level constraints, including the prevailing annotation backlog ( 16 – 18 ). We estimated the labelling propensity P(labelled | X, B) using a hazard model ( 39 ) (Appendix B), where X denotes the embedding and B denotes backlog covariates (e.g., recent unlabeled volume and within-sequence position). Inspired by backlog effects studied in actuarial delay and claims-processing settings ( 24 ), BALI then applies a Bayes adjustment to recover an estimate of P(Y | X) from the biased labelled subset, and uses this adjusted distribution to pseudo-label a prespecified subset of high-confidence unlabeled windows for training augmentation. Bayes adjustment: P(Y = y | unlabeled, X) ∝ P(unlabeled | Y = y). P(Y = y | X). where class-conditional censoring rates P(unlabeled | Y = y) were estimated directly from the labelling-propensity model ( 40 ). Pseudo-labels were accepted only when the maximum adjusted class probability exceeded a prespecified threshold, yielding conservative label augmentation intended to reduce false escalation. Pseudo-label quality was assessed using K-fold cross-fitting ( 27 ) with group-aware splits to prevent information leakage across customers. All pseudo-labelled windows were confined to the training split for downstream modelling; all reported validation, test, and temporal holdout results relied exclusively on original manual assessments. Conceptually, BALI treats the observation process (whether a window is labelled) as part of the data-generating mechanism, rather than an ignorable missingness pattern. This is analogous to queueing/backlog settings in which processed cases are a biased subset of all cases because processing capacity is finite ( 24 ). In our setting, we do not attempt to infer latent “true” event times or full delay distributions; the goal is a conservative correction that (i) improves training-label coverage under selective labelling, (ii) keeps embeddings and temporal structure unchanged, and (iii) preserves strict separation between training augmentation and manual-only evaluation. Dynamic Bayesian regime modelling with an improved GHMM To characterize dynamic behavioral regimes and transitions over time, we fitted a Gaussian hidden Markov model (GHMM) to the learned window embeddings ( 25 ). Prior to model fitting, embeddings were standardized and, where appropriate, dimensionally reduced to improve numerical stability, using rank-based marginal normalization ( 41 ) followed by whitening and optional principal component analysis. These transformations reduce the influence of heavy-tailed latent dimensions, improve covariance conditioning, and make Gaussian emission assumptions more tenable in finite samples. All transformations were estimated on the training split only and applied unchanged to validation, test, and temporal holdout data. GHMM training employed regularized transition structure to favor regime persistence and discourage implausible long-range jumps ( 42 ), reflecting the expectation that behavioral risk evolves gradually rather than instantaneously. Emission distributions were modelled as multivariate Gaussians with constrained covariance structure to ensure stable estimation in high dimensions. The number of states and regularization settings were selected on the training split using validation performance and stability diagnostics. To render regimes interpretable and ordinal, GHMM states were post-ordered using an auxiliary risk scorer derived from labeled windows, producing a monotone severity ordering without altering the fitted state dynamics. For downstream use, posterior summaries of the ordered states were mapped into three action-oriented risk classes using a calibrated multinomial mapper. This mapping yields a discrete operational proxy definition while preserving the probabilistic regime structure underpinning temporal transitions. Operational proxy definition and outcome construction While the Gaussian hidden Markov model captures latent behavioral regimes and their temporal transitions, operational use requires mapping these regimes into a small number of actionable risk categories. Manual analyst assessments were originally recorded on a five-level ordinal scale. For the operational proxy definition, these ratings were collapsed into three action-oriented classes (low, medium, and high) to reflect practical intervention thresholds for monitoring triage and workload planning, not clinical diagnosis. Several plausible five-to-three collapse schemes were prespecified. Because assessment distributions and operational review criteria differ by stream, the collapse was treated as a stream-specific calibration step: among the prespecified schemes, we selected the mapping that maximized balanced accuracy on the validation split and then fixed it and applied it unchanged to test and temporal hold-out data. This mapping defines the discrete proxy labels used for reporting and for escalation-event construction, without changing the learned embeddings or the fitted GHMM dynamics; all reported validation, test, and temporal holdout results rely exclusively on original manual analyst assessments. The final output of the operational definition is a three-class probability distribution for each 30-day window, obtained by combining posterior summaries of the ordered GHMM regimes with a calibrated multinomial mapping. This probabilistic formulation supports downstream thresholding and capacity-constrained decision policies while retaining uncertainty information. Statistical analysis Statistical analyses target two complementary questions: (i) how the probabilistic high-risk score supports capacity-constrained monitoring policies, and (ii) how the resulting three-class mapping agrees with selectively observed manual analyst assessments under strict leakage controls (criterion validity). Primary evaluation therefore focuses on policy-level early-warning utility under fixed review budgets (probability thresholds and top-K queue policies), quantified by detection of subsequent escalation events, lead time, alert-level PPV, escalation-level recall, and number needed to review. As supporting criterion-validity evidence on held-out labelled windows, we report balanced accuracy (macro-averaged recall), macro-averaged F1-score, class-specific precision/recall, and one-vs-rest area under the precision–recall curve (AUPRC) for the high-risk class. Argmax-derived class assignments are reported as diagnostics; operational policies act on calibrated P(high) rather than argmax class labels. Early warning and lead time: We defined escalation events as the first occurrence of a high-risk manual analyst assessment (under the stream-specific five-to-three collapse) and estimated lead time as the difference between the escalation date and the first model alert date occurring beforehand. Alerts were generated under two families of decision policies: (i) fixed probability thresholds on P(high | window), and (ii) capacity-constrained top-K queue policies that, within each review period (day or week), rank customers by P(high | window) (maximum score within the period) and issue alerts for the K highest-scoring cases, with at most one alert per customer-period. Lead-time distributions were summarized for horizons of 1, 7, 14, and 30 days. Operational actionability and decision-analytic evaluation: We evaluated operating points by reporting alert rate, precision, recall for future escalation within prespecified horizons, and number needed to review under capacity constraints. Calibration: We assessed calibration of the model’s probabilistic outputs to support downstream thresholding and capacity planning. Calibration was evaluated using reliability diagrams and expected calibration error (ECE), with binning performed on P(high) to match high-risk decision support. Temperature scaling ( 26 ) was used as a post-hoc recalibration method fit on the validation split only and then applied unchanged to held-out evaluation data. Uncertainty: We computed 95% confidence intervals using customer-level bootstrap resampling to respect within-customer dependence ( 43 ). Baselines and ablations: We compared the GHMM-based definition with (i) an ordinal baseline risk scorer trained on labelled windows alone, (ii) conventional non-temporal classifiers trained either on engineered feature sets or on learned window embeddings, and (iii) targeted ablations removing heavy-tail feature groups, BALI augmentation, or the GHMM temporal component. To contextualize incremental validity relative to incumbent systems, we additionally report an “RG proxy only” comparator based solely on the operator’s daily RG proxy predictions. Results We report results from two complementary perspectives: (i) operational utility under capacity-constrained review policies, and (ii) agreement with selectively observed manual analyst assessments on labelled windows (criterion validity / discrimination) under strict leakage controls. Operationally, under a top-10-per-week review queue the proposed definition detected 38.6–62.3% of escalation events across streams with median lead times of 42–290 days. We then report supporting evidence on label coverage and MNAR correction (BALI), face-validity diagnostics for heavy-tail features, agreement metrics, regime dynamics, calibration, baseline comparators, and frozen-pipeline temporal portability. Label coverage and backlog-aware label inference Manual analyst labels were operationally triaged and therefore sparse and potentially biased toward higher-risk windows. Training label coverage varied markedly by stream (transactions 3.0%, sessions 57.5%, payments 85.8%, bets 81.2%). Using hazard-based backlog-aware label inference (BALI) with a prespecified confidence threshold (t = 0.7), we pseudo-labelled an additional 8.0k–179.6k windows per stream, increasing effective training coverage to 15.3% (transactions), 96.0% (sessions), 96.8% (payments), and 93.3% (bets) (Table 1 ; Figure S5A–B; Appendix B). Table 1 Training label coverage and BALI pseudo-label yield. Label coverage is the proportion of windows manually labelled by analysts in the training split. BALI pseudo-label yield is the number of additional windows assigned pseudo-labels above the confidence threshold (t = 0.7), and the resulting effective labelled coverage. Cross-fit BA reports cross-fitted balanced accuracy of the BALI pseudo-labeller against held-out manual labels; BA method indicates the pseudo-labeller used. Stream N Train N Labeled (orig) Coverage (orig) N Pseudo (t = 0.7) Coverage (t = 0.7) Cross-fit BA BA Method transactions 438895 13004 0.030 54171 0.153 0.472 bali_probs_direct sessions 465817 267764 0.575 179606 0.960 0.368 bali_probs_direct payments 181104 155447 0.858 19898 0.968 0.223 bali_probs_direct bets 66100 53703 0.812 8001 0.933 0.461 bali_probs_direct Exceedance diagnostics and face validity of heavy-tail features Across streams, derived exceedance features generally showed monotone associations with analyst-assigned risk categories on the validation split, supporting face validity. Exceedance counts and excess magnitudes over short horizons were positively correlated with risk severity, with stream-specific patterns reflecting different operational signatures (Fig. 2 B–E). The strongest gradients were stream-specific: in payments, exceedances related to disrupted payment behavior (e.g., cancelled or reversed payment events and their excess magnitudes) increased sharply with risk; in sessions, extreme session-duration patterns were most informative; in bets, spikes in turnover and high-stakes activity dominated (Fig. 2 B–E). In temporal holdout, many (but not all) of these univariate correlations replicated in sign and magnitude, consistent with some drift in single-feature associations over time (Figure S3; Table S10). Figure S4 provides ladder plots stratified by manual risk category (Appendix A7). Agreement with manual analyst assessments (criterion validity) Agreement with manual analyst assessments varied by stream, with strongest criterion validity in payments and bets. On labelled test windows (Table 2 ), balanced accuracy (macro-averaged recall across three risk classes) was 0.624 (payments), 0.601 (bets), 0.515 (sessions), and 0.380 (transactions) (Fig. 3 A). For reference, chance-level balanced accuracy in a three-class setting is 1/3 ≈ 0.333, so all streams demonstrated better-than-random agreement, although agreement in transactions was only modestly above chance. Balanced accuracy can be interpreted as the average per-class recall (e.g., 0.624 indicates 62.4% mean sensitivity across classes) and is reported here as criterion-validity evidence under class imbalance. Because manual assessments are selectively observed, these metrics quantify agreement with observed labels rather than clinical ground truth. Figure 3 A reports results with and without BALI; we include both to illustrate sensitivity of agreement to selective labelling under the same leakage-audited protocol (Appendix B). Table 2 Criterion validity (agreement) on labelled test windows. Metrics are accuracy, balanced accuracy, macro-averaged F1, and high-risk recall for the three-class mapping after GHMM smoothing. Chance-level balanced accuracy is 0.33. Stream N Accuracy Balanced Accuracy Macro F1 High-Risk Recall transactions 2780 0.356 0.380 0.318 0.160 sessions 4340 0.261 0.515 0.289 0.757 payments 1922 0.256 0.624 0.198 0.025 bets 753 0.600 0.601 0.543 0.641 Class-specific recall patterns Class-wise recall patterns differed by stream (Table 3 ). Under argmax assignment, payments showed high recall for the low and medium categories (0.846 and 1.000) but very low recall for the high-risk category (0.025), indicating that the argmax three-class summary is conservative for high risk in this stream. Bets showed moderate high-risk recall (0.641) and high recall for the medium category (0.935), with lower recall for low risk (0.229). Sessions showed relatively high recall for high risk (0.757) but low recall for low risk (0.175). Transactions remained challenging, with high-risk recall 0.160 and medium-category recall 0.251. These argmax metrics are reported as diagnostics; operational queue policies act on P(high) rather than argmax class assignments, and can still yield useful early warning even when argmax high-risk recall is low (e.g., payments detection 47.9% under a top-10-per-week queue). Table 3 Class-wise recall and support on labelled test windows. Values are recall within each true class under argmax assignment for the three-class mapping. Stream Low Recall Low Support Medium Recall Medium Support High Recall High Support transactions 0.728 698 0.251 1639 0.160 443 sessions 0.175 3571 0.614 510 0.757 259 payments 0.846 514 1.000 22 0.025 1386 bets 0.229 140 0.935 92 0.641 521 Regime persistence and dynamics GHMM state sequences exhibited pronounced persistence with high self-transition probabilities and long implied dwell times, consistent with behavioural regimes that change slowly over time (Figure S6; Table 4 ). Mean self-transition probability ranged from 0.363 (payments) to 0.442 (bets), with maximum self-transition probabilities near 1.0 in all streams. Mean regime duration ranged from 19.9 steps (payments) to 165.6 steps (transactions); one step corresponds to one day (a one-day shift of the rolling window) (Table 4 ). Because GHMM transition estimation was regularized to favor persistence ( 42 ), these values should be interpreted as model-implied summaries under an explicit persistence prior; operationally, persistence gates provide an explicit operational control for trading alert stability against detection (Appendix F; Tables S5–S6). Table 4 GHMM regime persistence. Summary statistics of the fitted GHMM transition matrices by stream, including mean self-transition probability, its range, and mean regime duration (in steps; one step corresponds to one day / one rolling-window shift). Stream N States Mean Self-Trans Min Self-Trans Max Self-Trans Mean Duration Stationary Entropy transactions 10 0.383 0.067 0.999 165.6 1.255 sessions 12 0.403 0.080 0.999 126.6 2.039 payments 10 0.363 0.096 0.994 19.9 1.159 bets 10 0.442 0.111 0.996 31.7 1.228 Operational evaluation: early warning and capacity-constrained alerting Operationally, the proposed definition provided actionable early warning under capacity constraints. Using top-K queueing policies (e.g., selecting the 10 highest-risk customers per week based on P(high)), we achieved meaningful precision and recall trade-offs across horizons (Table 5 ; Fig. 4 A–B; Table S6). Under this top-10-per-week queue, escalation detection ranged from 38.6% (sessions) to 62.3% (bets), with median lead times of 42–290 days (Table 5 ). Persistence gates can be used as an operational control to trade alert stability and workload against detection and lead time (Appendix F; Tables S5–S6). These operating characteristics were robust across streams with the strongest early-warning performance observed for payments and bets. Table 5 Lead time under a top-10-per-week review queue. For each stream, detection rate is the fraction of escalation events preceded by at least one alert; lead time is the number of days between the first alert and escalation among detected events. P(lead ≥ 7) and P(lead ≥ 30) report the proportion of detected escalations with at least 7 or 30 days of warning. A fuller breakdown across lead-time thresholds is reported in Table S5. Stream N Escalations Detection Rate (95% CI) Median Lead (95% CI) % ≥7 days (95% CI) % ≥30 days transactions 85 55.3% [44.1%, 66.1%] 290 [142, 690] 54.1% [43.0%, 65.0%] 45.9% sessions 57 38.6% [26.0%, 52.4%] 42 [20, 57] 33.3% [21.4%, 47.1%] 24.6% payments 165 47.9% [40.1%, 55.8%] 84 [50, 124] 43.0% [35.4%, 51.0%] 32.1% bets 53 62.3% [47.9%, 75.2%] 212 [68, 428] 56.6% [42.3%, 70.2%] 49.1% Table 6: Queue yield under a top-10-per-week policy (30-day horizon). We report the number of alerts generated, alert-level positive predictive value (PPV), and escalation-level recall. Stream N Escalations N Alerts PPV (95% CI) N Caught Recall (95% CI) Alerts per Caught transactions 85 4007 1.0% [0.7%, 1.4%] 20 23.5% [15.0%, 34.0%] 200.3 sessions 57 2310 0.8% [0.5%, 1.2%] 12 21.1% [11.4%, 33.9%] 192.5 payments 165 2310 4.9% [4.1%, 5.9%] 58 35.2% [27.9%, 43.0%] 39.8 bets 53 2065 3.2% [2.5%, 4.0%] 23 43.4% [29.8%, 57.7%] 89.8 Calibration and uncertainty Calibration varied by stream. Reliability diagrams for two representative streams are shown in Fig. 5 A, with per-stream reliability diagrams provided in Figure S2 and bin-level summaries in Table S7. Temperature scaling reduced expected calibration error, but residual miscalibration remained in several streams (Table 7 ; Appendix F), motivating periodic recalibration and drift monitoring for any operational deployment. Customer-level bootstrap confidence intervals for balanced accuracy on the combined labelled evaluation set (test+holdout) are shown in Table 7 and Table S8 (Fig. 5 B). Table 7 Calibration and uncertainty. Customer-level bootstrap mean and 95% CI for balanced accuracy on the combined labelled evaluation set (test+holdout), and expected calibration error (ECE) on the same set. Stream N Combined ECE (raw) ECE (calibrated) Temperature BA Mean BA 95% CI transactions 5891 0.097 0.029 0.600 0.395 [0.368, 0.423] sessions 8449 0.287 0.191 0.400 0.497 [0.459, 0.539] payments 3311 0.441 0.277 5.000 0.560 [0.464, 0.625] bets 1066 0.294 0.261 5.000 0.599 [0.510, 0.650] Baseline ladder and ablation comparisons Criterion validity relative to comparator baselines was heterogeneous across streams (Table 8 ; Fig. 3 B; Table S9). Compared with conventional non-temporal baseline models, the full pipeline achieved the strongest balanced accuracy in bets and remained competitive in payments (Table 8 ), while in sessions and transactions some non-temporal baselines achieved similar or higher macro-F1. When compared with an incumbent RG proxy signal alone, agreement was stream-dependent: an “RG proxy only” comparator achieved comparable or higher balanced accuracy in transactions and payments, whereas the regime-based definition improved agreement in sessions and remained competitive in bets (Table S9). This heterogeneity supports treating the proposed definition as a monitoring construct whose utility is ultimately judged by capacity-constrained operating points rather than by single-metric superiority. Table 8 Conventional non-temporal baselines on labelled test windows. Logistic regression, random forest, and LightGBM ( 51 ) were trained as non-temporal classifiers (without GHMM temporal modelling) for comparison. Additional baselines, ablations, and the RG-proxy-only comparator are reported in the Supplementary Appendix (Table S9). Stream GHMM Pipeline BA GHMM Pipeline F1 LogReg BA LogReg F1 RF BA RF F1 LightGBM BA LightGBM F1 transactions 0.380 0.318 0.399 0.325 0.373 0.297 0.364 0.350 sessions 0.515 0.289 0.493 0.265 0.441 0.428 0.399 0.393 payments 0.624 0.198 0.527 0.258 0.429 0.303 0.351 0.340 bets 0.601 0.543 0.543 0.443 0.496 0.497 0.404 0.417 Temporal external validation In a frozen-pipeline temporal evaluation, criterion-validity agreement and calibration were broadly stable, with holdout balanced accuracy differing from the combined labelled evaluation set by − 0.02 to + 0.02 across streams (Table 9 ; Figure S7). For clarity, Table 9 reports metrics on the combined labelled evaluation set (test+holdout, used for bootstrap uncertainty in Table 7 ) alongside temporal holdout metrics; Table 2 reports test-only performance. Table 9 Frozen-pipeline temporal portability evaluation. Train BA is balanced accuracy on the training split. Combined (test+holdout) and holdout columns report criterion validity (agreement) (BA, F1) and calibration (ECE) under a pipeline trained on the training period and evaluated without refitting on the temporal holdout period. ΔBA is holdout BA minus combined BA. Stream Tr BA Comb BA Comb F1 Comb ECE Hold BA Hold F1 Hold ECE dBA Hold def. trans- actions 0.446 0.394 0.331 0.097 0.405 0.341 0.089 0.011 holdout split sessions 0.441 0.498 0.299 0.287 0.489 0.314 0.259 -0.009 holdout split pay- ments 0.538 0.567 0.219 0.441 0.550 0.231 0.473 -0.017 holdout split bets 0.650 0.610 0.516 0.294 0.634 0.437 0.404 0.024 holdout split Discussion This study shows that a practically usable operational proxy definition of gambling-related risk can be derived from routine behavioral telemetry using a leakage-controlled pipeline that preserves heavy-tailed dynamics and models risk as a time-varying regime. We provide complementary validity evidence: criterion validity (agreement with selectively observed manual assessments), temporal portability under a frozen pipeline, and operational utility under explicit capacity constraints. Criterion-validity agreement on held-out labelled windows was moderate (balanced accuracy 0.38–0.62) and remained similar in a frozen-pipeline temporal holdout (0.40–0.63), suggesting that results were not driven by period-specific fitting. Importantly, inferred state sequences exhibited pronounced persistence with high self-transition probabilities and long implied dwell times, supporting the interpretation of risk as a slowly evolving process rather than an i.i.d. classification problem. It is important to emphasize that the operational proxy definition derived here is not intended as a clinical diagnosis. Instead, it reflects behavioral patterns consistent with analyst-rated concern and is designed to support triage and workload planning. A central finding is that headline discrimination metrics can be misleading under severe class imbalance and operational constraints ( 44 ). Summarizing probabilistic outputs by “argmax” assignment forces each window into a single class even when uncertainty is high, and it does not reflect how monitoring systems are used. In practice, manual review and intervention capacity is finite, so the operative question is how many cases can be reviewed per unit time and what yield and lead time can be achieved under that budget ( 2 ). When we evaluated explicit capacity-constrained policies—such as selecting the K highest-risk customers per week based on P(high), a top-10-per-week queue detected 38.6–62.3% of escalation events across streams, with median lead times of 42–290 days (Table 5 ). At a 30-day horizon, alert-level positive predictive value was low (0.8%–4.9%) and escalation-level recall ranged from 21% to 43%, reflecting both low base rates and the constraints imposed by realistic review budgets (Table 6 ). These results emphasize that operational usefulness is determined by workload, yield, and lead time, not by accuracy alone. Placed in the context of account-based gambling research, our work aligns with evidence that behavioral telemetry can support identification of higher-risk play ( 8 , 12 , 45 , 46 ), while addressing limitations repeatedly noted in synthesis work: heterogeneity in outcomes and reporting practices, potential leakage and feedback loops, and limited evaluation of temporal stability and deployment-relevant operating points ( 13 , 15 ). Rather than treating risk as a static classification target, we treat it as a latent state process, enforce strict causal ordering at the window level, and translate probabilities into explicit decision policies (thresholds, top-K queues, persistence gates) that can be audited and compared. Comparator analyses highlight that incumbent responsible-gaming (RG) proxy signals can already explain a substantial fraction of the observed assessment process in some streams (Table S9). In transactions and payments, an RG-proxy-only comparator achieved criterion-validity agreement comparable to or higher than the full pipeline, consistent with the possibility that operational labelling and existing RG systems are coupled through triage and feedback-loop effects ( 13 – 15 ). We therefore interpret the proposed approach as a definition framework that makes modelling choices and operating points explicit (leakage auditing, tail-preserving representations, regime dynamics, queue policies), rather than as a claim of universal single-metric dominance over incumbent proxies. The heavy-tail representation layer is a key methodological and interpretive contribution. Gambling activity is dominated by rare extremes rather than average behavior, and naïve clipping or long-horizon aggregation can erase precisely the signals that matter for prevention-oriented monitoring ( 3 ). By explicitly representing extreme behavior via exceedance indicators (frequency) and exceedance magnitudes (intensity), summarized over short horizons, the framework distinguishes persistent clustering of extreme events from isolated spikes. In descriptive diagnostics, exceedance-derived features generally exhibited monotone gradients with analyst-assessed severity across streams, supporting face validity and anchoring regime outputs in observable behavioral signatures. This representation provides a behavioral narrative consistent with a dynamic risk process: escalation is often reflected in increasing persistence and clustering of extremes rather than a smooth shift in mean behavior. Some univariate associations drifted in temporal holdout (Table S10), reinforcing the need for frozen-pipeline temporal evaluation and for representations that do not rely on any single handcrafted feature remaining stable over time. Across stages, the design goal was structural alignment: treat gambling telemetry as a multi-scale, heavy-tailed process and build an analysis pipeline that respects that geometry ( 3 , 4 , 5 ). Exceedance features capture tail events at short horizons; the hierarchical CVAE compresses correlated dynamics across time scales without washing out rare extremes; teacher–student training uses abundant proxy supervision to shape the latent space and sparse gold labels to refine decision boundaries with minimal drift; BALI then corrects supervision under selective labelling; and the GHMM imposes an explicit temporal state model whose persistence can be audited and translated into queue-based review policies ( 16 – 18 , 25 ). Taken together, these choices define a coherent and logical modelling story: risk as a slowly varying latent regime manifested through clustered extremes rather than a grab bag of methods. The representation learning stage was designed to summarize 30-day windows into stable low-dimensional embeddings while preventing leakage from sparse manual labels. A hierarchical conditional variational autoencoder with multiple temporal scales ( 21 , 22 ) supports feature compression in a setting where engineered inputs are high-dimensional and strongly correlated, and the teacher–student regime decouples large-scale proxy-supervised representation learning from manual-label adaptation ( 23 ). While such embeddings are not inherently interpretable, we used embedding-geometry diagnostics to assess whether the latent space preserved an ordinal structure aligned with manual risk categories and to characterize differences in signal across streams and temporal scales. These diagnostics do not establish clinical validity, but they provide an auditable link between modelling intent (bursts, rhythms, drift) and how information is organized in the representation, while collapse-safeguard diagnostics help verify that latents are meaningfully used ( 35 ). A major practical challenge in this domain is selective labelling. Manual analyst assessments are operationally triaged and plausibly MNAR: windows are labelled preferentially when behavioral signals and system context (e.g., backlog) make review more likely ( 13 – 18 ). To mitigate resulting selection bias, we introduced Backlog-Aware Label Inference (BALI), which models labelling propensity using embeddings and backlog covariates, applies inverse-probability weighting ( 40 ), and uses a conservative Bayes adjustment to pseudo-label a prespecified subset of high-confidence unlabeled windows for training. The approach was inspired by backlog and delay effects studied in actuarial claims-processing settings ( 24 ) but is adapted here to the selective-label setting of capacity-constrained human review workflows. Critically, pseudo-labelled windows were confined to the training split; all reported validation, test, and temporal holdout results relied exclusively on original manual assessments, preserving a strict separation between training augmentation and evaluation. Hidden Markov models provide a standard statistical framework for latent regime inference in time series under observation noise ( 25 , 47 ). Dynamic regime modelling with a Gaussian hidden Markov model provides a natural bridge between machine learning outputs and systems-level interpretation ( 47 ). Regimes yield a compact state description and transition structure that can be logged, audited, and analyzed over time ( 48 ). In our setting, the GHMM is not a smoothing heuristic applied to a risk score; it encodes the modelling assumption that risk evolves as a persistent latent regime whose switching dynamics should be estimated explicitly and can be audited ( 42 ). Because transition estimation was regularized to favor persistence, persistence statistics should be interpreted as model-implied summaries under an explicit persistence prior; operationally, we leverage regime persistence through “persistence gates” that provide an explicit operational control for trading alert stability against detection and lead time (Appendix F; Tables S5–S6). This is particularly important in heavy-tailed behavioral domains, where isolated extreme days may be common but not necessarily indicative of sustained risk ( 49 ). Regime posteriors also support calibrated probabilities that can be mapped to review budgets and policy thresholds, making governance choices explicit rather than implicit ( 2 ). Although online gambling is the motivating application, the methodological problem structure is broader: heavy-tailed digital traces, temporally evolving latent states, selectively observed human labels, and capacity-constrained decision workflows occur across many techno-socio-economic systems, including fraud and abuse monitoring, insurance claims triage, credit and financial risk, and platform integrity ( 50 ). The framework’s emphasis on leakage control, MNAR-aware supervision, dynamic state modelling, and capacity-aware evaluation is intended to be transferable to these settings. Several limitations should be considered. First, the study uses data from a single operator; external validation across jurisdictions, products, and regulatory contexts is needed to assess generalizability. Second, manual analyst assessments are an operational reference standard rather than clinical diagnoses; they may be sparse, noisy, and potentially influenced by existing RG systems, complicating interpretation of “ground truth” and introducing feedback-loop risks ( 13 – 15 ). Third, early-warning analyses are offline simulations anchored to the timing of analyst assessments, which may lag underlying harm; results therefore quantify potential operational lead time rather than causal impact on harms. These results should therefore be interpreted as system-level backtests of monitoring policies under fixed data and label processes, not as estimates of intervention effectiveness. Fourth, calibration varied by stream; while temperature scaling reduced expected calibration error, any operational use would require ongoing recalibration and drift monitoring, particularly in label-scarce settings. Future work should prioritize multi-operator external validation; prospective evaluation of intervention policies derived from regime dynamics; linkage of regime-based trajectories to external outcomes (e.g., treatment contact, linked surveys, or other harm proxies); and formal audits for fairness, drift, and unintended feedback effects. Methodological extensions include multi-stream fusion and joint modelling of multiple harm endpoints while preserving the leakage controls and capacity-aware evaluation used here. In summary, heavy-tail-aware behavioral representations combined with dynamic regime modelling can support a transparent operational proxy definition of gambling-related risk and enable explicit evaluation of early warning and operating points under realistic review-capacity constraints. Abbreviations AUPRC Area under the precision–recall curve BA Balanced accuracy BALI Backlog-aware label inference CI Confidence interval CVAE Conditional variational autoencoder DSM-5 Diagnostic and Statistical Manual of Mental Disorders, 5th edition ECE Expected calibration error F1-score Harmonic mean of precision and recall GHMM Gaussian hidden Markov model KL Kullback–Leibler LSTM Long short-term memory MNAR Missing not at random NNR Number needed to review PCA Principal component analysis PGSI Problem Gambling Severity Index POT Peaks over threshold PPV Positive predictive value PR Participation ratio RF Random forest RG Responsible gambling SNPS Standardized normalized partial sums TCN Temporal convolutional network VAE Variational autoencoder Declarations Availability of data and materials Raw behavioral event data underpinning this study are subject to contractual and privacy restrictions with the participating operator and cannot be shared publicly. Access to a de-identified analysis extract may be considered for qualified researchers on reasonable request, subject to operator approval and appropriate data protection and confidentiality agreements. The code used to generate features, enforce leakage audits, train models, and (given access to the restricted analysis extract) reproduce all tables and figures is available at the project repository (52). An archived snapshot corresponding to this manuscript is available on Zenodo (53) (release tag: v0.1.0) under the GNU General Public License v3.0 (GPL-3.0-or-later). The public repository includes configuration files, run-order scripts, and a synthetic demo dataset sufficient to run a smoke-test of the downstream pipeline (BALI→GHMM) without exposing personal data. Availability and requirements (software) Project name: Definition Study (EPJ Data Science) — Reproducibility Repository (definition-study-epjds) Project home page: https://github.com/SamAndersson-C/definition-study-epjds Archived version: Zenodo release DOI: https://doi.org/10.5281/zenodo.18653580 (release tag: v0.1.1) Operating system(s): Platform independent Programming language: Python (≥3.9) Other requirements: See requirements-demo.txt (synthetic demo) and requirements.in (reference internal environment). A standard Python virtual environment is recommended. License: GNU General Public License v3.0 (GPL-3.0-or-later) Any restrictions to use by non-academics: None beyond the GPL-3.0-or-later license terms. Restricted operator data are not included in this repository. Ethics The study procedures were carried out in accordance with the Declaration of Helsinki. The study was reviewed and approved by the Swedish Ethical Review Authority (Dnr 2023-07288-02). Informed consent was waived by the review board to permit research on pre-existing registry data. Consent for publication Not applicable. Competing interests This study was conducted as part of an industry–academia collaboration on Responsible Gambling financed by the LeoVegas Group, a licensed gambling operator in Sweden. The research was planned, conducted, and submitted under full academic freedom, as guaranteed by a written agreement. The funder had no role in the study design or conduct, data analysis or interpretation, or the decision to publish. SA’s doctoral position is financed by the LeoVegas Group; SA is employed by Karolinska Institutet and reports no other competing interests. PL and PC report past and ongoing industry–academia collaborations with multiple gambling providers, including project-specific research funding, and report no personal financial ties to the gambling industry. OM has received funding from the Independent Research Council of Svenska Spel for clinical studies unrelated to the present study. All other authors declare no competing interests. Funding This study was funded by the LeoVegas Group, a licensed gambling operator in Sweden. Authors' contributions SA conceptualized the study and methodology, developed and implemented the software pipeline, conducted the analyses, and wrote the manuscript. PL and PC contributed to the conceptualization by proposing the initial clinically motivated research question (to develop a proxy definition of problem gambling from behavioral data) and through scientific discussions. HW, TK, and KL contributed to study conceptualization and methodological development through detailed scientific and methodological discussions. PL secured funding. All authors (HW, TK, KL, PC, PL, and OM) reviewed the manuscript. PL and OM were responsible for project administration. Acknowledgements I would like to acknowledge the Department of Mathematical Statistics at Stockholm University and its seminar series for providing a stimulating environment and many valuable ideas that informed this work. AI Use Declaration During the preparation of this manuscript, the authors used the Claude Code API (Anthropic) and ChatGPT (OpenAI) to assist with language editing and improve clarity and readability. The authors reviewed and edited the outputs as needed and take full responsibility for the content of the publication. Authors' information Not applicable. Additional files Additional file 1: Supplementary Appendix (DOCX). Extended methods, additional diagnostics, and additional tables/figures supporting the main text. References Lazer D, Pentland A, Adamic L, Aral S, Barabási AL, Brewer D et al (2009) Computational social science. Science 323(5915):721–723 Kleinberg J, Lakkaraju H, Leskovec J, Ludwig J, Mullainathan S (2018) Human decisions and machine predictions. Q J Econ 133(1):237–293 Wang X, Pleimling M (2019) Online gambling of pure chance: wager distribution, risk attitude, and anomalous diffusion. Sci Rep 9:14712 Barabási AL (2005) The origin of bursts and heavy tails in human dynamics. Nature 435(7039):207–211 Clauset A, Shalizi CR, Newman MEJ (2009) Power-law distributions in empirical data. SIAM Rev 51(4):661–703 Wardle H, Degenhardt L, Marionneau V et al (2024) The Lancet Public Health Commission on gambling. Lancet Public Health 9(11):e950–e994. https://doi.org/10.1016/S2468-2667(24)00167-1 Allami Y (2024) Strengthening oversight and integrity: The multi-faceted role of centralized player tracking systems in gambling. Addiction 119(7):1170–1171 Auer M, Griffiths MD (2023) The relationship between structural characteristics and gambling behaviour: an online gambling player tracking study. J Gambl Stud 39(1):265–279 Ferris J, Wynne H (2001) The Canadian Problem Gambling Index: final report. Canadian Centre on Substance Abuse, Ottawa Currie SR, Hodgins DC, Casey DM (2013) Validity of the Problem Gambling Severity Index interpretive categories. J Gambl Stud 29(2):311–327 Miller NV, Currie SR, Hodgins DC, Casey D (2013) Validation of the Problem Gambling Severity Index using confirmatory factor analysis and Rasch modelling. Int J Methods Psychiatr Res 22(3):245–255 Braverman J, LaPlante DA, Nelson SE, Shaffer HJ (2013) Using cross-game behavioral markers for early identification of high-risk internet gamblers. Psychol Addict Behav 27(3):868–877 Ghaharian K, Abarbanel B, Phung D, Puranik P, Kraus S, Feldman A et al (2023) Applications of data science for responsible gambling: a scoping review. Int Gambl Stud 23(2):289–312 Magnusson K, Nilsson A, Andersson G, Hellner C, Carlbring P (2019) Level of agreement between problem gamblers’ and collaterals’ reports: a Bayesian random-effects two-part model. J Gambl Stud 35(4):1127–1145 Murch WS, Kairouz S, French M (2024) Establishing the temporal stability of machine learning models that detect online gambling-related harms. Comput Hum Behav Rep 14:100427 Lakkaraju H, Kleinberg J, Leskovec J, Ludwig J, Mullainathan S (2017) The selective labels problem: evaluating algorithmic predictions in the presence of unobservables. In: Proc 23rd ACM SIGKDD Int Conf Knowl Discov Data Min De-Arteaga M, Fogliato R, Chouldechova A, G’Sell M (2018) Learning under selective labels in the presence of expert consistency. arXiv:1807.00905 Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592 Ahmad Z, Mahmoudi E, Hamedani GG, Kharazmi O (2020) New methods to define heavy-tailed distributions with applications to insurance data. J Taibah Univ Sci 14(1):359–382 Kaufman S, Rosset S, Perlich C, Stitelman O (2012) Leakage in data mining: formulation, detection, and avoidance. ACM Trans Knowl Discov Data 6(4):15 Kingma DP, Welling M (2013) Auto-Encoding Variational Bayes. arXiv:1312.6114 Sohn K, Lee H, Yan X (2015) Learning structured output representation using deep conditional generative models. Adv Neural Inf Process Syst 28:3483–3491 Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv:1503.02531 Lindskog F, Wüthrich MV (2025) Eliciting claims development patterns and costs hidden in backlogs. Eur Actuar J 15(3):667–705 Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286 Guo C, Pleiss G, Sun Y, Weinberger KQ (2017) On calibration of modern neural networks. In: Proc 34th Int Conf Mach Learn (ICML). PMLR 70:1321–1330 Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins JM (2018) Double/debiased machine learning for treatment and structural parameters. Econometrics J 21(1):C1–C68 Coles S (2001) An introduction to statistical modeling of extreme values. Springer, London Page ES (1954) Continuous inspection schemes. Biometrika 41(1–2):100–115 Coles SG, Heffernan J, Tawn JA (1999) Dependence measures for extreme value analyses. Extremes 2(4):339–365 Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780 Breiman L (2001) Random forests. Mach Learn 45(1):5–32 Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271 Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. arXiv:1706.03762 He J, Spokoyny D, Neubig G, Berg-Kirkpatrick T (2019) Lagging inference networks and posterior collapse in variational autoencoders. In: Proc Int Conf Learn Represent (ICLR) Bowman SR, Vilnis L, Vinyals O, Dai AM, Jozefowicz R, Bengio S (2016) Generating sentences from a continuous space. In: Proc 20th SIGNLL Conf Comput Nat Lang Learn (CoNLL). pp 10–21 Kingma DP, Salimans T, Jozefowicz R, Chen X, Sutskever I, Welling M (2016) Improving variational inference with inverse autoregressive flow. arXiv:1606.04934 Miyato T, Kataoka T, Koyama M, Yoshida Y (2018) Spectral normalization for generative adversarial networks. arXiv:1802.05957 Cox DR (1972) Regression models and life-tables. J R Stat Soc Ser B 34(2):187–220 Seaman SR, White IR (2013) Review of inverse probability weighting for dealing with missing data. Stat Methods Med Res 22(3):278–295 Liu H, Lafferty J, Wasserman L (2009) The nonparanormal: semiparametric estimation of high dimensional undirected graphs. J Mach Learn Res 10:2295–2328 Fox EB, Sudderth EB, Jordan MI, Willsky AS (2011) A sticky HDP-HMM with application to speaker diarization. Ann Appl Stat 5(2A):1020–1056 Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman & Hall, London Saito T, Rehmsmeier M (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10(3):e0118432 Catania M, Griffiths MD (2021) Applying the DSM-5 criteria for gambling disorder to online gambling account-based tracking data: an empirical study utilizing cluster analysis. J Gambl Stud 38(4):1289–1306 Murch WS, Kairouz S, Dauphinais S, Picard E, Costes J, French M (2023) Using machine learning to retrospectively predict self-reported gambling problems in Quebec. Addiction 118(8):1569–1578 Cappé O, Moulines E, Rydén T (2005) Inference in hidden Markov models. Springer, New York Sculley D, Holt G, Golovin D et al (2015) Hidden technical debt in machine learning systems. Adv Neural Inf Process Syst 28:2503–2511 Hill BM (1975) A simple general approach to inference about the tail of a distribution. Ann Stat 3(5):1163–1174 Bolton RJ, Hand DJ (2002) Statistical fraud detection: a review. Stat Sci 17(3):235–255 Ke G, Meng Q, Finley T et al (2017) LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30:3146–3154 Andersson S (2026) Definition Study (EPJ Data Science) — Reproducibility Repository (definition-study-epjds). GitHub repository. Available at: https://github.com/SamAndersson-C/definition-study-epjds (Accessed 16 Feb 2026) Andersson S (2026) Definition Study (EPJ Data Science) — Reproducibility Repository (definition-study-epjds) (v0.1.1). Zenodo. https://doi.org/10.5281/zenodo.18653580 (Accessed 16 Feb 2026) Additional Declarations No competing interests reported. Supplementary Files DefinitionStudySupplementaryAppendixEPJDS20260216FINALSUBMIT.docx Cite Share Download PDF Status: Under Review Version 1 posted Reviews received at journal 03 Apr, 2026 Reviewers agreed at journal 01 Apr, 2026 Reviewers invited by journal 16 Mar, 2026 Editor assigned by journal 16 Feb, 2026 Submission checks completed at journal 16 Feb, 2026 First submitted to journal 16 Feb, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8889984","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":604360732,"identity":"0d237128-d7f4-4412-b4c3-20fb59b060bd","order_by":0,"name":"Sam Andersson","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA1klEQVRIiWNgGAWjYPACZihdkcBDqpYzJGthbEsgrFa+/XQCw88d1nLm/ccfPi6clybDwH/4AF4tBmdyNzD2nkk3lrmRY2w8c1sOD4NEGn6rDBhyNzDwth1OnCHBwybNu60CqIXHAL/D+t9uYPzbdrh+Bv/xZ9K8c4Ba+M9/wO+ZG7kbmIG2JEgwJJhJ8zYAHcaQg1+HwY23Gw7LtqUbzpAA+oXnWBoPm0QaIYflbnz4ts1aXoIfGGI8Ncn2/PyHH+C3BggOoPDYCKofBaNgFIyCUUAQAADp2z6t+n831QAAAABJRU5ErkJggg==","orcid":"","institution":"Karolinska Institutet","correspondingAuthor":true,"prefix":"","firstName":"Sam","middleName":"","lastName":"Andersson","suffix":""},{"id":604360734,"identity":"f5627fad-578a-49e9-9ec1-c5b8ee312404","order_by":1,"name":"Helga Westerlind","email":"","orcid":"","institution":"Karolinska Institutet","correspondingAuthor":false,"prefix":"","firstName":"Helga","middleName":"","lastName":"Westerlind","suffix":""},{"id":604360736,"identity":"5000ecc5-8eb0-47db-be0d-3b127e413697","order_by":2,"name":"Timo Koski","email":"","orcid":"","institution":"KTH Royal Institute of Technology","correspondingAuthor":false,"prefix":"","firstName":"Timo","middleName":"","lastName":"Koski","suffix":""},{"id":604360737,"identity":"4a089510-ab97-4b6a-85f7-758c7863f1c3","order_by":3,"name":"Keenan Lyon","email":"","orcid":"","institution":"Swedbank AB, Economic Crime Prevention","correspondingAuthor":false,"prefix":"","firstName":"Keenan","middleName":"","lastName":"Lyon","suffix":""},{"id":604360738,"identity":"7ac9970e-db34-497a-b8dd-2ff791e62867","order_by":4,"name":"Per Carlbring","email":"","orcid":"","institution":"Stockholm University","correspondingAuthor":false,"prefix":"","firstName":"Per","middleName":"","lastName":"Carlbring","suffix":""},{"id":604360739,"identity":"ebac9dc8-a371-4ea7-b441-28e9365678e1","order_by":5,"name":"Philip Lindner","email":"","orcid":"","institution":"Karolinska Institutet \u0026 Stockholm Health Care Services","correspondingAuthor":false,"prefix":"","firstName":"Philip","middleName":"","lastName":"Lindner","suffix":""},{"id":604360740,"identity":"40febd18-f249-463d-896a-e0390adb1406","order_by":6,"name":"Olof Molander","email":"","orcid":"","institution":"Karolinska Institutet \u0026 Stockholm Health Care Services","correspondingAuthor":false,"prefix":"","firstName":"Olof","middleName":"","lastName":"Molander","suffix":""}],"badges":[],"createdAt":"2026-02-16 05:53:38","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8889984/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8889984/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":104481800,"identity":"6bb15720-a0d2-4be7-ad0d-c551dcfc2d2f","added_by":"auto","created_at":"2026-03-12 09:33:59","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":252889,"visible":true,"origin":"","legend":"\u003cp\u003eStudy design and analysis pipeline.\u003c/p\u003e\n\u003cp\u003eOverview of the study pipeline, including windowing, leakage controls, model training with BALI pseudo-labels, and evaluation on labelled test and temporal holdout splits.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8889984/v1/0d37cc1806ad89fc2446e042.png"},{"id":104481799,"identity":"228a76c5-0751-4c04-8e1f-6480e91af8bd","added_by":"auto","created_at":"2026-03-12 09:33:59","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":209289,"visible":true,"origin":"","legend":"\u003cp\u003eExceedance features and their association with manual risk categories.\u003c/p\u003e\n\u003cp\u003ePanel A illustrates exceedance representation: observations above a threshold (τ) define exceedance indicators and excess magnitudes. Panels B–E show mean exceedance-derived feature value by manual risk category for the top validation-selected feature in each stream (transactions, sessions, payments, bets); ρ denotes Spearman correlation on the validation split.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8889984/v1/288a59c759efdd5051faf5b9.png"},{"id":104780984,"identity":"02b30c78-d2d2-4c58-963f-1ddaa9964d1c","added_by":"auto","created_at":"2026-03-17 07:54:23","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":108587,"visible":true,"origin":"","legend":"\u003cp\u003eDiscrimination performance across streams and baselines.\u003c/p\u003e\n\u003cp\u003ePanel A reports balanced accuracy (three-class mapping) for the deep model with and without BALI, compared with baseline models, on the labelled test set. Panel B compares ladder and ablation baselines.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8889984/v1/1f66fcd93b683a176798373d.png"},{"id":104481798,"identity":"eeec71b6-a628-4f67-93aa-7c564ccea733","added_by":"auto","created_at":"2026-03-12 09:33:59","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":138420,"visible":true,"origin":"","legend":"\u003cp\u003eOperational evaluation of early-warning and capacity-constrained alerting.\u003c/p\u003e\n\u003cp\u003eLead time and workload-yield trade-off under a top-10-per-week queueing policy. Panel A shows lead-time distributions and detection rates; Panel B summarizes 30-day horizon yield (PPV) and escalation-level recall, including alerts per escalation caught. Alternative fixed-threshold and persistence-gate policies are reported in Tables S5–S6.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8889984/v1/f35253b46231c0e57ed98f91.png"},{"id":104481802,"identity":"c2f5c618-6529-4f83-81aa-41648554c05f","added_by":"auto","created_at":"2026-03-12 09:33:59","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":159366,"visible":true,"origin":"","legend":"\u003cp\u003eCalibration and uncertainty. Panel A shows reliability diagrams for representative streams; Panel B shows bootstrap uncertainty intervals for balanced accuracy on the combined labelled evaluation set (test+holdout).\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8889984/v1/b9b61d42022f1cd2b449135e.png"},{"id":104785952,"identity":"f71a3d27-d7ee-45bc-a486-186874349797","added_by":"auto","created_at":"2026-03-17 08:13:52","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2284565,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8889984/v1/127aec5f-edce-4272-a977-065c60300c13.pdf"},{"id":104781100,"identity":"415b8259-c2c5-4810-9926-62560b4eb1aa","added_by":"auto","created_at":"2026-03-17 07:54:46","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":6535988,"visible":true,"origin":"","legend":"","description":"","filename":"DefinitionStudySupplementaryAppendixEPJDS20260216FINALSUBMIT.docx","url":"https://assets-eu.researchsquare.com/files/rs-8889984/v1/b4f35dca4d7940e5c38dfa75.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Heavy-tail-aware representation learning and dynamic Bayesian state modelling to derive an operational proxy definition of problem gambling risk from routine online gambling data","fulltext":[{"header":"Introduction","content":"\u003cp\u003eDigital platforms generate dense \u0026ldquo;digital traces\u0026rdquo; of human behavior that can be analyzed as first-order objects for understanding and monitoring socio-technical systems (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e). In many operational settings, the core scientific and engineering challenge is not simply prediction, but definition: how to infer a latent, time-varying risk state from heavy-tailed, temporally dependent behavioral data when expert labels are sparse and selectively observed, and when downstream actions are constrained by finite review capacity (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e). Online gambling provides a clear instance of this broader problem. Behavioral logs capture high-frequency patterns of engagement and spending, but activity is dominated by rare extreme days rather than typical averages, and meaningful signals may appear as bursts, clustering, or persistent shifts rather than smooth trends (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eGambling-related harm is associated with substantial impacts on health, well-being, and socioeconomic outcomes, and a persistent operational challenge is identifying elevated risk early enough to prevent escalation (\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e). Many responsible gambling (RG) programs rely on self-report screening tools, voluntary limit-setting, and rule-based triggers such as deposits or losses above fixed thresholds (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e). These approaches can be delayed (flagging risk after large losses), brittle to behavioral adaptation, and poorly suited to heavy-tailed behavioral distributions where a small number of days account for a large fraction of wagering or losses (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e). For instance, a customer may exceed a monthly deposit threshold only after incurring substantial losses over many extreme-wagering days, by which time harm may already have occurred (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThe rapid digitalization of gambling products has increased \u0026ldquo;always-on\u0026rdquo; access and the volume of routine telemetry available for monitoring, strengthening calls for scalable evaluation and oversight infrastructure built on routinely collected data (\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eSurvey-based definitions and psychometric instruments remain essential for screening and prevalence estimation, but they were not designed for near-real-time operational monitoring on platform data (\u003cspan additionalcitationids=\"CR10\" citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e). This motivates complementary operational definitions that treat risk as time-varying, can be evaluated prospectively against outcomes and service responses, and explicitly incorporate resource constraints that determine what monitoring systems can realistically do (\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e). In this context, an \u0026ldquo;operational proxy definition\u0026rdquo; is a model-derived categorization intended to support consistent triage and evaluation, not to assert a clinical diagnosis.\u003c/p\u003e \u003cp\u003eBehavioral and machine-learning approaches using account-based tracking data are increasingly common (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e). However, online gambling telemetry is strongly non-Gaussian and temporally structured, and supervisory signals are often sparse and operationally triaged (e.g., manual reviews), creating a missing-not-at-random (MNAR) setting in which the probability of receiving a label depends on behavioral signals and system-level constraints (\u003cspan additionalcitationids=\"CR14 CR15 CR16 CR17\" citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e). In addition, operational feedback loops can arise when existing RG systems influence which customers are reviewed and labelled (\u003cspan additionalcitationids=\"CR14\" citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). As a result, methods must preserve informative extremes, enforce causal ordering to prevent label leakage, quantify temporal portability, and represent risk as a dynamic process rather than a static score (\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e, \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e). Yet, to our knowledge, no published framework has integrated all of these elements into a single auditable pipeline.\u003c/p\u003e \u003cp\u003eIn this study, we derive an operational proxy definition of gambling-related risk from routine platform telemetry and triaged analyst assessments, with the explicit goal of enabling auditable early warning and workload planning rather than clinical diagnosis (\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e). We propose an end-to-end framework that (i) represents heavy-tailed behavior using interpretable exceedance events (whether extreme activity occurred) and exceedance magnitudes (how extreme it was), summarized over short horizons; (ii) learns leakage-safe, label-aware embeddings of 30-day behavioral windows using a teacher\u0026ndash;student conditional variational autoencoder (CVAE) (\u003cspan additionalcitationids=\"CR22\" citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e); (iii) corrects for MNAR manual labelling using a backlog-aware label inference step (\u003cspan additionalcitationids=\"CR17\" citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e); and (iv) infers time-varying behavioral regimes using a Gaussian hidden Markov model (GHMM) to produce calibrated three-class probabilities for operational use (\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e, \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eOur objectives were to (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e) derive a discrete, interpretable regime-based operational proxy definition of gambling-related risk; (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e) validate alignment to manual analyst assessments under strict leakage controls; (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e) quantify generalization across customers and over time using temporal holdout evaluation; and (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e) evaluate early-warning performance and operating points under explicit capacity constraints that reflect realistic review budgets (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eStudy design and data sources\u003c/h2\u003e \u003cp\u003eWe conducted an observational analysis using de-identified daily behavioral records from customers of a licensed online gambling operator. Four behavioral streams were analyzed separately: transactions (daily financial movements), bets (stakes and wins/losses), sessions (engagement timing and duration), and payments (deposit and withdrawal events). Two supervisory sources were available: manual analyst assessments of gambling risk (five ordinal categories, used as the reference standard for evaluation) and daily responsible-gaming predictions (used as proxy labels for representation learning and feature selection).\u003c/p\u003e \u003cp\u003eThese data characteristics, high-frequency behavioral logs, sparse expert labels, and heavy-tailed distributions motivated a staged analysis pipeline designed to separate data preparation, representation learning, label correction, and dynamic modelling.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003ePipeline overview and order of operations\u003c/h3\u003e\n\u003cp\u003eThe analysis was implemented as a deterministic, leakage-controlled pipeline with persisted artefacts between stages. Processing followed a fixed sequence designed to enforce causal ordering and reproducibility: (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e) preprocessing with hybrid window construction and explicit auditing for label leakage; (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e) feature engineering tailored to heavy-tailed behavioral data, with parameters and transformations frozen across splits and window types; (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e) feature selection restricted to normalized and derived features to reduce redundancy while preserving short-horizon temporal signal; (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e) representation learning using a hierarchical conditional variational autoencoder (CVAE) trained in a teacher\u0026ndash;student regime with risk-specific priors; (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e) backlog-aware label inference (BALI) to augment training data under missing-not-at-random (MNAR) manual assessment; and (\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e) dynamic regime modelling using a Gaussian hidden Markov model (GHMM) to derive a three-level operational proxy definition of risk.\u003c/p\u003e \u003cp\u003eDesign rationale: The pipeline is deliberately multi-scale and tail-focused. Gambling telemetry tends to be heavy-tailed and bursty, and dependencies can cascade across streams and time scales (e.g., long sessions co-occurring with high stakes and followed by payment events) (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e). Our guiding principle was to align each modelling component with this structure: short-horizon exceedance features preserve informative extremes; the hierarchical CVAE compresses short-, mid-, and long-range dynamics into separate latent blocks; and the GHMM then treats these embeddings as noisy observations of a slowly evolving latent regime.\u003c/p\u003e \u003cp\u003eAll transformations, thresholds, and tuning decisions were estimated using the training split only (or training folds under cross-fitting) and then applied unchanged to validation, test, and temporal hold-out sets (\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e). Early stages of the pipeline focus on constructing leakage-safe analysis units and stable feature representations before any label-dependent modelling is performed, ensuring that downstream performance reflects generalization rather than artefacts of temporal or supervisory leakage.\u003c/p\u003e\n\u003ch3\u003eWindowing, splits, and leakage controls\u003c/h3\u003e\n\u003cp\u003eBehavioral time series were converted into rolling 30-day analysis windows with a one-day stride using two complementary constructions. Activity-based windows comprised 30 sequential daily records and were split at prolonged inactivity gaps (runs of \u0026ge;\u0026thinsp;G consecutive days with no recorded activity; G\u0026thinsp;=\u0026thinsp;6), maximizing coverage for unsupervised and proxy-supervised representation learning while preserving within-window temporal structure. Calendar-based windows comprised 30 consecutive calendar days (allowing limited inactivity) and were aligned to manual analyst assessment dates to support supervised evaluation on a fixed temporal scale. We used 30-day windows (and, where relevant, multiples thereof) to match operational reporting and to capture approximately monthly periodicities in gambling behavior (e.g., pay-cycle effects). In practice, this means that for each manual assessment, we defined a fixed look-back window ending before the assessment date.\u003c/p\u003e \u003cp\u003eTo minimize dependence and prevent information leakage, we applied a hybrid splitting strategy. Customers were assigned exclusively to training, validation, or test sets, ensuring cross-customer independence. Within training customers, the most recent fraction (latest 20%) of each individual timeline was further reserved as a temporal hold-out to evaluate short-term generalization under realistic forward-in-time (\u0026ldquo;future-risk\u0026rdquo;) conditions. (\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e)\u003c/p\u003e \u003cp\u003eA strict label-leakage audit was then applied across all splits and window types. Any window for which a manual assessment date fell within the corresponding 30-day interval (start date\u0026thinsp;\u0026le;\u0026thinsp;label date\u0026thinsp;\u0026le;\u0026thinsp;end date) was excluded, treating labels as occurring after the predictive window. This audit enforces causal ordering by ensuring that no window contains information contemporaneous with, or subsequent to, its supervisory label (\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e). All subsequent feature construction, representation learning, and regime modelling were performed exclusively on these leakage-controlled windows.\u003c/p\u003e \u003cp\u003eStudy design, data splitting, leakage audit rules, and the complete analysis pipeline are summarized in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\n\u003ch3\u003ePreprocessing and enhanced heavy-tail feature engineering\u003c/h3\u003e\n\u003cp\u003ePreprocessing was designed to correct implausible data artefacts while preserving genuine heavy-tailed behavior characteristic of gambling activity. We applied systematic audit rules to identify and correct implausible records (e.g. sign errors and impossible values; Appendix A), enforced consistent handling of raw and transformed variables, and retained large magnitudes without aggressive truncation to avoid distorting tail behavior.\u003c/p\u003e \u003cp\u003eFor heavy-tailed metrics, we represented extreme behavior using per-feature exceedance constructions inspired by peaks-over-threshold (POT) ideas (\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e). Per-feature thresholds derived from the training data were used to generate (i) exceedance indicators, capturing whether unusually large events occurred, and (ii) exceedance magnitudes, defined as how far observations exceeded the threshold. Excess magnitudes were log-transformed to stabilize scale while preserving relative intensity (Appendix A). This representation focuses explicitly on behavior beyond high thresholds and helps retain information about rare but operationally meaningful events that can be obscured by global transformations or aggressive truncation. Extremes were then summarized over multiple short horizons using block statistics and rolling aggregates, such as weekly exceedance counts and maximum excess. This representation preserves separate information about the frequency and severity of extreme events, rather than collapsing them into a single scale-normalized summary.\u003c/p\u003e \u003cp\u003eTo characterize temporal dynamics beyond raw magnitude, we additionally engineered scale-free momentum features using standardized normalized partial sums (SNPS), which help distinguish sustained patterns of activity from short-lived spikes (\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e). We also included lagged tail-dependence interaction features (χ-lift) to capture short-horizon cascades across behavioral dimensions, such as clustered extreme behavior occurring close in time (\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e). All preprocessing parameters\u0026mdash;including imputation values, tail thresholds, selected interaction pairs, and final feature ordering\u0026mdash;were estimated on the training split only and frozen via a consistency manager before application to validation, test, and temporal hold-out windows.\u003c/p\u003e \u003cp\u003eThis process resulted in a high-dimensional but stable feature representation, motivating a subsequent feature-selection stage to reduce redundancy while retaining the most temporally informative signals prior to representation learning.\u003c/p\u003e\n\u003ch3\u003eFeature selection\u003c/h3\u003e\n\u003cp\u003eTo reduce dimensionality and improve downstream stability, we performed feature selection after enhanced feature engineering and before representation learning. Selection was restricted to normalized and derived features; raw base columns were excluded by design to avoid scale artefacts and leakage from untransformed heavy-tailed magnitudes. The inactivity-gap feature (days since last activity) was always retained, as it encodes operationally relevant disengagement independent of monetary intensity.\u003c/p\u003e \u003cp\u003eFeature relevance was ranked using a multi-criterion, sequence-aware approach combining unsupervised measures of temporal structure with auxiliary responsible-gaming (RG) signals. Unsupervised criteria quantified within-window temporal variability, cross-feature redundancy (absolute correlations), and structural contribution (PCA loadings). We additionally estimated temporal predictability by fitting a next-window forecasting model (LSTM in our implementation) (\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e) and computing permutation-based importance (\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e). To bias selection toward operationally meaningful signals without using study outcome labels, we overlaid weak supervision from RG model outputs, including association with continuous RG scores and sensitivity around predicted RG category transitions. Final rankings were obtained by weighted aggregation of these scores while retaining a small set of predefined stability features.\u003c/p\u003e \u003cp\u003eThe final selected feature set was determined on the training split only and then fixed and applied unchanged to validation, test, and temporal hold-out data. This procedure reduced redundancy while preserving short-horizon temporal dynamics and ensured that downstream representation learning operated on a stable, leakage-controlled feature basis. Implementation details for each criterion are provided in Appendix.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eRepresentation learning with a hierarchical conditional VAE and teacher\u0026ndash;student training\u003c/h2\u003e \u003cp\u003eUsing the fixed, selected feature set, we learned low-dimensional representations of each 30-day window for downstream temporal modelling. Fixed-length embeddings were obtained with a hierarchical conditional variational autoencoder (CVAE) (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e) with three parallel causal temporal convolutional network (TCN) encoders (short/mid/long receptive fields) (\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e). Each encoder produced a sequence of hidden states which was contextualized with multihead self-attention (\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e) and summarized by masked mean pooling to yield a scale-specific representation. Latent parameters were produced hierarchically, with the mid-level latent conditioned on the short latent and the long-level latent conditioned on both short and mid; the three latent vectors were concatenated to form the final embedding.\u003c/p\u003e \u003cp\u003eTo stabilize risk semantics in latent space, the CVAE used class-specific diagonal-Gaussian priors at each temporal scale, with one learned prior per risk category (plus an explicit \u0026lsquo;unknown\u0026rsquo; prior), rather than a single standard normal prior. This encourages embeddings associated with similar risk levels to occupy coherent regions of the latent space while preserving overlap for borderline cases.\u003c/p\u003e \u003cp\u003eTraining followed a two-phase teacher\u0026ndash;student design to decouple representation learning from supervision on sparse manual labels. In phase 1, a teacher model was trained on activity-based windows using RG-derived proxy signals, combining masked reconstruction with auxiliary prediction heads for RG score (regression) and RG category distribution (classification). In phase 2, a student model was adapted to calendar-based windows aligned to manual analyst assessments: the TCN encoders and decoder were frozen, and only the manual classification head together with the final latent projection layers were updated, yielding a deliberate \u0026ldquo;fine-tuning\u0026rdquo; step that preserves the coarse latent geometry learned by the teacher while adapting decision boundaries to gold labels. Student training used cross-entropy with class-balanced sampling, with an additional small ordinal shaping term and (optionally) a lightweight distillation loss aligning student predictions to the teacher RG head (\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eTo avoid posterior collapse and promote stable latent utilization, an issue common in VAE training (\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e, \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e), training used KL annealing (warm-up) together with free-bits thresholding (\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e), spectral normalization (\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e), and a diversity regularize on the aggregated latent covariance. Latent utilization was monitored with a participation-ratio diagnostic to detect collapse. During calendar-window adaptation and embedding extraction, the encoder was always conditioned on an explicit \u0026lsquo;unknown-label\u0026rsquo; token to prevent oracle label conditioning. The resulting embeddings were then treated as fixed representations and used as inputs for all subsequent label inference and temporal regime modelling. Implementation details are provided in Appendix.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eBacklog-aware label inference (BALI) for MNAR manual labels\u003c/h3\u003e\n\u003cp\u003eWe introduce backlog-aware label inference (BALI), a conservative pseudo-labelling step applied after embedding extraction to mitigate selection bias arising from operational triage of manual analyst assessments. Manual labels were treated as missing not at random (MNAR): the probability that a window receives a manual rating depends jointly on behavioral signals, summarized by the learned embeddings, and on system-level constraints, including the prevailing annotation backlog (\u003cspan additionalcitationids=\"CR17\" citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eWe estimated the labelling propensity P(labelled | X, B) using a hazard model (\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e) (Appendix B), where X denotes the embedding and B denotes backlog covariates (e.g., recent unlabeled volume and within-sequence position). Inspired by backlog effects studied in actuarial delay and claims-processing settings (\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e), BALI then applies a Bayes adjustment to recover an estimate of P(Y | X) from the biased labelled subset, and uses this adjusted distribution to pseudo-label a prespecified subset of high-confidence unlabeled windows for training augmentation.\u003c/p\u003e \u003cp\u003eBayes adjustment: P(Y\u0026thinsp;=\u0026thinsp;y | unlabeled, X) \u0026prop; P(unlabeled | Y\u0026thinsp;=\u0026thinsp;y). P(Y\u0026thinsp;=\u0026thinsp;y | X).\u003c/p\u003e \u003cp\u003ewhere class-conditional censoring rates P(unlabeled | Y\u0026thinsp;=\u0026thinsp;y) were estimated directly from the labelling-propensity model (\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e).\u003c/p\u003e \u003cp\u003ePseudo-labels were accepted only when the maximum adjusted class probability exceeded a prespecified threshold, yielding conservative label augmentation intended to reduce false escalation. Pseudo-label quality was assessed using K-fold cross-fitting (\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e) with group-aware splits to prevent information leakage across customers. All pseudo-labelled windows were confined to the training split for downstream modelling; all reported validation, test, and temporal holdout results relied exclusively on original manual assessments.\u003c/p\u003e \u003cp\u003eConceptually, BALI treats the observation process (whether a window is labelled) as part of the data-generating mechanism, rather than an ignorable missingness pattern. This is analogous to queueing/backlog settings in which processed cases are a biased subset of all cases because processing capacity is finite (\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e). In our setting, we do not attempt to infer latent \u0026ldquo;true\u0026rdquo; event times or full delay distributions; the goal is a conservative correction that (i) improves training-label coverage under selective labelling, (ii) keeps embeddings and temporal structure unchanged, and (iii) preserves strict separation between training augmentation and manual-only evaluation.\u003c/p\u003e\n\u003ch3\u003eDynamic Bayesian regime modelling with an improved GHMM\u003c/h3\u003e\n\u003cp\u003eTo characterize dynamic behavioral regimes and transitions over time, we fitted a Gaussian hidden Markov model (GHMM) to the learned window embeddings (\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e). Prior to model fitting, embeddings were standardized and, where appropriate, dimensionally reduced to improve numerical stability, using rank-based marginal normalization (\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e) followed by whitening and optional principal component analysis. These transformations reduce the influence of heavy-tailed latent dimensions, improve covariance conditioning, and make Gaussian emission assumptions more tenable in finite samples. All transformations were estimated on the training split only and applied unchanged to validation, test, and temporal holdout data.\u003c/p\u003e \u003cp\u003eGHMM training employed regularized transition structure to favor regime persistence and discourage implausible long-range jumps (\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e), reflecting the expectation that behavioral risk evolves gradually rather than instantaneously. Emission distributions were modelled as multivariate Gaussians with constrained covariance structure to ensure stable estimation in high dimensions. The number of states and regularization settings were selected on the training split using validation performance and stability diagnostics.\u003c/p\u003e \u003cp\u003eTo render regimes interpretable and ordinal, GHMM states were post-ordered using an auxiliary risk scorer derived from labeled windows, producing a monotone severity ordering without altering the fitted state dynamics. For downstream use, posterior summaries of the ordered states were mapped into three action-oriented risk classes using a calibrated multinomial mapper. This mapping yields a discrete operational proxy definition while preserving the probabilistic regime structure underpinning temporal transitions.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eOperational proxy definition and outcome construction\u003c/h2\u003e \u003cp\u003eWhile the Gaussian hidden Markov model captures latent behavioral regimes and their temporal transitions, operational use requires mapping these regimes into a small number of actionable risk categories. Manual analyst assessments were originally recorded on a five-level ordinal scale. For the operational proxy definition, these ratings were collapsed into three action-oriented classes (low, medium, and high) to reflect practical intervention thresholds for monitoring triage and workload planning, not clinical diagnosis.\u003c/p\u003e \u003cp\u003eSeveral plausible five-to-three collapse schemes were prespecified. Because assessment distributions and operational review criteria differ by stream, the collapse was treated as a stream-specific calibration step: among the prespecified schemes, we selected the mapping that maximized balanced accuracy on the validation split and then fixed it and applied it unchanged to test and temporal hold-out data. This mapping defines the discrete proxy labels used for reporting and for escalation-event construction, without changing the learned embeddings or the fitted GHMM dynamics; all reported validation, test, and temporal holdout results rely exclusively on original manual analyst assessments.\u003c/p\u003e \u003cp\u003eThe final output of the operational definition is a three-class probability distribution for each 30-day window, obtained by combining posterior summaries of the ordered GHMM regimes with a calibrated multinomial mapping. This probabilistic formulation supports downstream thresholding and capacity-constrained decision policies while retaining uncertainty information.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eStatistical analysis\u003c/h2\u003e \u003cp\u003eStatistical analyses target two complementary questions: (i) how the probabilistic high-risk score supports capacity-constrained monitoring policies, and (ii) how the resulting three-class mapping agrees with selectively observed manual analyst assessments under strict leakage controls (criterion validity). Primary evaluation therefore focuses on policy-level early-warning utility under fixed review budgets (probability thresholds and top-K queue policies), quantified by detection of subsequent escalation events, lead time, alert-level PPV, escalation-level recall, and number needed to review. As supporting criterion-validity evidence on held-out labelled windows, we report balanced accuracy (macro-averaged recall), macro-averaged F1-score, class-specific precision/recall, and one-vs-rest area under the precision\u0026ndash;recall curve (AUPRC) for the high-risk class. Argmax-derived class assignments are reported as diagnostics; operational policies act on calibrated P(high) rather than argmax class labels.\u003c/p\u003e \u003cp\u003eEarly warning and lead time: We defined escalation events as the first occurrence of a high-risk manual analyst assessment (under the stream-specific five-to-three collapse) and estimated lead time as the difference between the escalation date and the first model alert date occurring beforehand. Alerts were generated under two families of decision policies: (i) fixed probability thresholds on P(high | window), and (ii) capacity-constrained top-K queue policies that, within each review period (day or week), rank customers by P(high | window) (maximum score within the period) and issue alerts for the K highest-scoring cases, with at most one alert per customer-period. Lead-time distributions were summarized for horizons of 1, 7, 14, and 30 days.\u003c/p\u003e \u003cp\u003eOperational actionability and decision-analytic evaluation: We evaluated operating points by reporting alert rate, precision, recall for future escalation within prespecified horizons, and number needed to review under capacity constraints.\u003c/p\u003e \u003cp\u003eCalibration: We assessed calibration of the model\u0026rsquo;s probabilistic outputs to support downstream thresholding and capacity planning. Calibration was evaluated using reliability diagrams and expected calibration error (ECE), with binning performed on P(high) to match high-risk decision support. Temperature scaling (\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e) was used as a post-hoc recalibration method fit on the validation split only and then applied unchanged to held-out evaluation data.\u003c/p\u003e \u003cp\u003eUncertainty: We computed 95% confidence intervals using customer-level bootstrap resampling to respect within-customer dependence (\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eBaselines and ablations: We compared the GHMM-based definition with (i) an ordinal baseline risk scorer trained on labelled windows alone, (ii) conventional non-temporal classifiers trained either on engineered feature sets or on learned window embeddings, and (iii) targeted ablations removing heavy-tail feature groups, BALI augmentation, or the GHMM temporal component. To contextualize incremental validity relative to incumbent systems, we additionally report an \u0026ldquo;RG proxy only\u0026rdquo; comparator based solely on the operator\u0026rsquo;s daily RG proxy predictions.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cp\u003eWe report results from two complementary perspectives: (i) operational utility under capacity-constrained review policies, and (ii) agreement with selectively observed manual analyst assessments on labelled windows (criterion validity / discrimination) under strict leakage controls. Operationally, under a top-10-per-week review queue the proposed definition detected 38.6\u0026ndash;62.3% of escalation events across streams with median lead times of 42\u0026ndash;290 days. We then report supporting evidence on label coverage and MNAR correction (BALI), face-validity diagnostics for heavy-tail features, agreement metrics, regime dynamics, calibration, baseline comparators, and frozen-pipeline temporal portability.\u003c/p\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eLabel coverage and backlog-aware label inference\u003c/h2\u003e \u003cp\u003eManual analyst labels were operationally triaged and therefore sparse and potentially biased toward higher-risk windows. Training label coverage varied markedly by stream (transactions 3.0%, sessions 57.5%, payments 85.8%, bets 81.2%). Using hazard-based backlog-aware label inference (BALI) with a prespecified confidence threshold (t\u0026thinsp;=\u0026thinsp;0.7), we pseudo-labelled an additional 8.0k\u0026ndash;179.6k windows per stream, increasing effective training coverage to 15.3% (transactions), 96.0% (sessions), 96.8% (payments), and 93.3% (bets) (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e; Figure S5A\u0026ndash;B; Appendix B).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eTraining label coverage and BALI pseudo-label yield. Label coverage is the proportion of windows manually labelled by analysts in the training split. BALI pseudo-label yield is the number of additional windows assigned pseudo-labels above the confidence threshold (t\u0026thinsp;=\u0026thinsp;0.7), and the resulting effective labelled coverage. Cross-fit BA reports cross-fitted balanced accuracy of the BALI pseudo-labeller against held-out manual labels; BA method indicates the pseudo-labeller used.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"8\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStream\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN Train\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eN Labeled (orig)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCoverage (orig)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eN Pseudo (t\u0026thinsp;=\u0026thinsp;0.7)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eCoverage (t\u0026thinsp;=\u0026thinsp;0.7)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eCross-fit BA\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eBA Method\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003etransactions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e438895\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e13004\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.030\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e54171\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.153\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.472\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebali_probs_direct\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esessions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e465817\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e267764\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.575\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e179606\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.960\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.368\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebali_probs_direct\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003epayments\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e181104\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e155447\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.858\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e19898\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.968\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.223\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebali_probs_direct\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ebets\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e66100\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e53703\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.812\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e8001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.933\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.461\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebali_probs_direct\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eExceedance diagnostics and face validity of heavy-tail features\u003c/h2\u003e \u003cp\u003eAcross streams, derived exceedance features generally showed monotone associations with analyst-assigned risk categories on the validation split, supporting face validity. Exceedance counts and excess magnitudes over short horizons were positively correlated with risk severity, with stream-specific patterns reflecting different operational signatures (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB\u0026ndash;E). The strongest gradients were stream-specific: in payments, exceedances related to disrupted payment behavior (e.g., cancelled or reversed payment events and their excess magnitudes) increased sharply with risk; in sessions, extreme session-duration patterns were most informative; in bets, spikes in turnover and high-stakes activity dominated (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB\u0026ndash;E). In temporal holdout, many (but not all) of these univariate correlations replicated in sign and magnitude, consistent with some drift in single-feature associations over time (Figure S3; Table S10). Figure S4 provides ladder plots stratified by manual risk category (Appendix A7).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eAgreement with manual analyst assessments (criterion validity)\u003c/h2\u003e \u003cp\u003eAgreement with manual analyst assessments varied by stream, with strongest criterion validity in payments and bets. On labelled test windows (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e), balanced accuracy (macro-averaged recall across three risk classes) was 0.624 (payments), 0.601 (bets), 0.515 (sessions), and 0.380 (transactions) (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA). For reference, chance-level balanced accuracy in a three-class setting is 1/3\u0026thinsp;\u0026asymp;\u0026thinsp;0.333, so all streams demonstrated better-than-random agreement, although agreement in transactions was only modestly above chance. Balanced accuracy can be interpreted as the average per-class recall (e.g., 0.624 indicates 62.4% mean sensitivity across classes) and is reported here as criterion-validity evidence under class imbalance. Because manual assessments are selectively observed, these metrics quantify agreement with observed labels rather than clinical ground truth. Figure\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA reports results with and without BALI; we include both to illustrate sensitivity of agreement to selective labelling under the same leakage-audited protocol (Appendix B).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eCriterion validity (agreement) on labelled test windows. Metrics are accuracy, balanced accuracy, macro-averaged F1, and high-risk recall for the three-class mapping after GHMM smoothing. Chance-level balanced accuracy is 0.33.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStream\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eBalanced Accuracy\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eMacro F1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eHigh-Risk Recall\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003etransactions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e2780\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.356\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.380\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.318\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.160\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esessions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e4340\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.261\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.515\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.289\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.757\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003epayments\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1922\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.256\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.624\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.198\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.025\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ebets\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e753\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.600\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.601\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.543\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.641\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eClass-specific recall patterns\u003c/h2\u003e \u003cp\u003eClass-wise recall patterns differed by stream (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). Under argmax assignment, payments showed high recall for the low and medium categories (0.846 and 1.000) but very low recall for the high-risk category (0.025), indicating that the argmax three-class summary is conservative for high risk in this stream. Bets showed moderate high-risk recall (0.641) and high recall for the medium category (0.935), with lower recall for low risk (0.229). Sessions showed relatively high recall for high risk (0.757) but low recall for low risk (0.175). Transactions remained challenging, with high-risk recall 0.160 and medium-category recall 0.251. These argmax metrics are reported as diagnostics; operational queue policies act on P(high) rather than argmax class assignments, and can still yield useful early warning even when argmax high-risk recall is low (e.g., payments detection 47.9% under a top-10-per-week queue).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eClass-wise recall and support on labelled test windows. Values are recall within each true class under argmax assignment for the three-class mapping.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStream\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLow Recall\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLow Support\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMedium Recall\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eMedium Support\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eHigh Recall\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eHigh Support\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003etransactions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.728\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e698\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.251\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1639\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.160\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e443\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esessions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.175\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3571\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.614\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e510\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.757\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e259\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003epayments\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.846\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e514\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e22\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.025\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1386\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ebets\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.229\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e140\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.935\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e92\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.641\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e521\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eRegime persistence and dynamics\u003c/h2\u003e \u003cp\u003eGHMM state sequences exhibited pronounced persistence with high self-transition probabilities and long implied dwell times, consistent with behavioural regimes that change slowly over time (Figure S6; Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). Mean self-transition probability ranged from 0.363 (payments) to 0.442 (bets), with maximum self-transition probabilities near 1.0 in all streams. Mean regime duration ranged from 19.9 steps (payments) to 165.6 steps (transactions); one step corresponds to one day (a one-day shift of the rolling window) (Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). Because GHMM transition estimation was regularized to favor persistence (\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e), these values should be interpreted as model-implied summaries under an explicit persistence prior; operationally, persistence gates provide an explicit operational control for trading alert stability against detection (Appendix F; Tables S5\u0026ndash;S6).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eGHMM regime persistence. Summary statistics of the fitted GHMM transition matrices by stream, including mean self-transition probability, its range, and mean regime duration (in steps; one step corresponds to one day / one rolling-window shift).\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStream\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN States\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMean Self-Trans\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMin Self-Trans\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eMax Self-Trans\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eMean Duration\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eStationary Entropy\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003etransactions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.383\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.067\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.999\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e165.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.255\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esessions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.403\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.080\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.999\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e126.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e2.039\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003epayments\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.363\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.096\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.994\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e19.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.159\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ebets\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.442\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.111\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.996\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e31.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.228\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003eOperational evaluation: early warning and capacity-constrained alerting\u003c/h2\u003e \u003cp\u003eOperationally, the proposed definition provided actionable early warning under capacity constraints. Using top-K queueing policies (e.g., selecting the 10 highest-risk customers per week based on P(high)), we achieved meaningful precision and recall trade-offs across horizons (Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e; Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA\u0026ndash;B; Table S6). Under this top-10-per-week queue, escalation detection ranged from 38.6% (sessions) to 62.3% (bets), with median lead times of 42\u0026ndash;290 days (Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e). Persistence gates can be used as an operational control to trade alert stability and workload against detection and lead time (Appendix F; Tables S5\u0026ndash;S6). These operating characteristics were robust across streams with the strongest early-warning performance observed for payments and bets.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eLead time under a top-10-per-week review queue. For each stream, detection rate is the fraction of escalation events preceded by at least one alert; lead time is the number of days between the first alert and escalation among detected events. P(lead\u0026thinsp;\u0026ge;\u0026thinsp;7) and P(lead\u0026thinsp;\u0026ge;\u0026thinsp;30) report the proportion of detected escalations with at least 7 or 30 days of warning. A fuller breakdown across lead-time thresholds is reported in Table S5.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStream\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN Escalations\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDetection Rate (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMedian Lead (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003e% \u0026ge;7 days (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003e% \u0026ge;30 days\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003etransactions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e85\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e55.3% [44.1%, 66.1%]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e290 [142, 690]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e54.1% [43.0%, 65.0%]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e45.9%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esessions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e57\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e38.6% [26.0%, 52.4%]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e42 [20, 57]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e33.3% [21.4%, 47.1%]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e24.6%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003epayments\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e165\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e47.9% [40.1%, 55.8%]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e84 [50, 124]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e43.0% [35.4%, 51.0%]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e32.1%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ebets\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e53\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e62.3% [47.9%, 75.2%]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e212 [68, 428]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e56.6% [42.3%, 70.2%]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e49.1%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cp\u003eTable 6: Queue yield under a top-10-per-week policy (30-day horizon). We report the number of alerts generated, alert-level positive predictive value (PPV), and escalation-level recall.\u003c/p\u003e\n\u003cdiv\u003e\n \u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"576\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eStream\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eN Escalations\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eN Alerts\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e\u003cstrong\u003ePPV (95% CI)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eN Caught\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eRecall (95% CI)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAlerts per Caught\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003etransactions\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e85\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e4007\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e1.0% [0.7%, 1.4%]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e20\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e23.5% [15.0%, 34.0%]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e200.3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003esessions\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e57\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e2310\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e0.8% [0.5%, 1.2%]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e21.1% [11.4%, 33.9%]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e192.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003epayments\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e165\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e2310\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e4.9% [4.1%, 5.9%]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e35.2% [27.9%, 43.0%]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e39.8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003ebets\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e53\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e2065\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e3.2% [2.5%, 4.0%]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e23\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e43.4% [29.8%, 57.7%]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 14.2857%;\"\u003e\n \u003cp\u003e89.8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003eCalibration and uncertainty\u003c/h2\u003e \u003cp\u003eCalibration varied by stream. Reliability diagrams for two representative streams are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eA, with per-stream reliability diagrams provided in Figure S2 and bin-level summaries in Table S7. Temperature scaling reduced expected calibration error, but residual miscalibration remained in several streams (Table\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e7\u003c/span\u003e; Appendix F), motivating periodic recalibration and drift monitoring for any operational deployment. Customer-level bootstrap confidence intervals for balanced accuracy on the combined labelled evaluation set (test+holdout) are shown in Table\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e7\u003c/span\u003e and Table S8 (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eB).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 7\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eCalibration and uncertainty. Customer-level bootstrap mean and 95% CI for balanced accuracy on the combined labelled evaluation set (test+holdout), and expected calibration error (ECE) on the same set.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStream\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN Combined\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eECE (raw)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eECE (calibrated)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eTemperature\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eBA Mean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eBA 95% CI\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003etransactions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e5891\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.097\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.029\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.600\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.395\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e[0.368, 0.423]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esessions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e8449\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.287\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.191\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.400\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.497\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e[0.459, 0.539]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003epayments\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e3311\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.441\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.277\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e5.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.560\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e[0.464, 0.625]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ebets\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1066\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.294\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.261\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e5.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.599\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e[0.510, 0.650]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003eBaseline ladder and ablation comparisons\u003c/h2\u003e \u003cp\u003eCriterion validity relative to comparator baselines was heterogeneous across streams (Table\u0026nbsp;\u003cspan refid=\"Tab7\" class=\"InternalRef\"\u003e8\u003c/span\u003e; Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB; Table S9). Compared with conventional non-temporal baseline models, the full pipeline achieved the strongest balanced accuracy in bets and remained competitive in payments (Table\u0026nbsp;\u003cspan refid=\"Tab7\" class=\"InternalRef\"\u003e8\u003c/span\u003e), while in sessions and transactions some non-temporal baselines achieved similar or higher macro-F1. When compared with an incumbent RG proxy signal alone, agreement was stream-dependent: an \u0026ldquo;RG proxy only\u0026rdquo; comparator achieved comparable or higher balanced accuracy in transactions and payments, whereas the regime-based definition improved agreement in sessions and remained competitive in bets (Table S9). This heterogeneity supports treating the proposed definition as a monitoring construct whose utility is ultimately judged by capacity-constrained operating points rather than by single-metric superiority.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab7\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 8\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eConventional non-temporal baselines on labelled test windows. Logistic regression, random forest, and LightGBM (\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e) were trained as non-temporal classifiers (without GHMM temporal modelling) for comparison. Additional baselines, ablations, and the RG-proxy-only comparator are reported in the Supplementary Appendix (Table S9).\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"9\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStream\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGHMM Pipeline BA\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eGHMM Pipeline F1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLogReg BA\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eLogReg F1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eRF BA\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eRF F1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eLightGBM BA\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003eLightGBM F1\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003etransactions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.380\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.318\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.399\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.325\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.373\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.297\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.364\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.350\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esessions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.515\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.289\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.493\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.265\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.441\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.428\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.399\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.393\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003epayments\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.624\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.198\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.527\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.258\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.429\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.303\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.351\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.340\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ebets\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.601\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.543\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.543\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.443\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.496\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.497\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.404\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.417\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003eTemporal external validation\u003c/h2\u003e \u003cp\u003eIn a frozen-pipeline temporal evaluation, criterion-validity agreement and calibration were broadly stable, with holdout balanced accuracy differing from the combined labelled evaluation set by \u0026minus;\u0026thinsp;0.02 to +\u0026thinsp;0.02 across streams (Table\u0026nbsp;\u003cspan refid=\"Tab8\" class=\"InternalRef\"\u003e9\u003c/span\u003e; Figure S7). For clarity, Table\u0026nbsp;\u003cspan refid=\"Tab8\" class=\"InternalRef\"\u003e9\u003c/span\u003e reports metrics on the combined labelled evaluation set (test+holdout, used for bootstrap uncertainty in Table\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e7\u003c/span\u003e) alongside temporal holdout metrics; Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e reports test-only performance.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab8\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 9\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eFrozen-pipeline temporal portability evaluation. Train BA is balanced accuracy on the training split. Combined (test+holdout) and holdout columns report criterion validity (agreement) (BA, F1) and calibration (ECE) under a pipeline trained on the training period and evaluated without refitting on the temporal holdout period. ΔBA is holdout BA minus combined BA.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"10\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c10\" colnum=\"10\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStream\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTr\u003c/p\u003e \u003cp\u003eBA\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eComb BA\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eComb F1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eComb\u003c/p\u003e \u003cp\u003eECE\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eHold\u003c/p\u003e \u003cp\u003eBA\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eHold F1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eHold\u003c/p\u003e \u003cp\u003eECE\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003edBA\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c10\"\u003e \u003cp\u003eHold def.\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003etrans-\u003c/p\u003e \u003cp\u003eactions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.446\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.394\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.331\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.097\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.405\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.341\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.089\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.011\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003eholdout split\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esessions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.441\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.498\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.299\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.287\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.489\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.314\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.259\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e-0.009\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003eholdout split\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003epay-\u003c/p\u003e \u003cp\u003ements\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.538\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.567\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.219\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.441\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.550\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.231\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.473\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e-0.017\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003eholdout split\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ebets\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.650\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.610\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.516\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.294\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.634\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.437\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.404\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.024\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003eholdout split\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis study shows that a practically usable operational proxy definition of gambling-related risk can be derived from routine behavioral telemetry using a leakage-controlled pipeline that preserves heavy-tailed dynamics and models risk as a time-varying regime. We provide complementary validity evidence: criterion validity (agreement with selectively observed manual assessments), temporal portability under a frozen pipeline, and operational utility under explicit capacity constraints. Criterion-validity agreement on held-out labelled windows was moderate (balanced accuracy 0.38\u0026ndash;0.62) and remained similar in a frozen-pipeline temporal holdout (0.40\u0026ndash;0.63), suggesting that results were not driven by period-specific fitting. Importantly, inferred state sequences exhibited pronounced persistence with high self-transition probabilities and long implied dwell times, supporting the interpretation of risk as a slowly evolving process rather than an i.i.d. classification problem. It is important to emphasize that the operational proxy definition derived here is not intended as a clinical diagnosis. Instead, it reflects behavioral patterns consistent with analyst-rated concern and is designed to support triage and workload planning.\u003c/p\u003e \u003cp\u003eA central finding is that headline discrimination metrics can be misleading under severe class imbalance and operational constraints (\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e). Summarizing probabilistic outputs by \u0026ldquo;argmax\u0026rdquo; assignment forces each window into a single class even when uncertainty is high, and it does not reflect how monitoring systems are used. In practice, manual review and intervention capacity is finite, so the operative question is how many cases can be reviewed per unit time and what yield and lead time can be achieved under that budget (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e). When we evaluated explicit capacity-constrained policies\u0026mdash;such as selecting the K highest-risk customers per week based on P(high), a top-10-per-week queue detected 38.6\u0026ndash;62.3% of escalation events across streams, with median lead times of 42\u0026ndash;290 days (Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e). At a 30-day horizon, alert-level positive predictive value was low (0.8%\u0026ndash;4.9%) and escalation-level recall ranged from 21% to 43%, reflecting both low base rates and the constraints imposed by realistic review budgets (Table\u0026nbsp;\u003cspan refid=\"Tab9\" class=\"InternalRef\"\u003e6\u003c/span\u003e). These results emphasize that operational usefulness is determined by workload, yield, and lead time, not by accuracy alone.\u003c/p\u003e \u003cp\u003ePlaced in the context of account-based gambling research, our work aligns with evidence that behavioral telemetry can support identification of higher-risk play (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e, \u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e, \u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e), while addressing limitations repeatedly noted in synthesis work: heterogeneity in outcomes and reporting practices, potential leakage and feedback loops, and limited evaluation of temporal stability and deployment-relevant operating points (\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). Rather than treating risk as a static classification target, we treat it as a latent state process, enforce strict causal ordering at the window level, and translate probabilities into explicit decision policies (thresholds, top-K queues, persistence gates) that can be audited and compared.\u003c/p\u003e \u003cp\u003eComparator analyses highlight that incumbent responsible-gaming (RG) proxy signals can already explain a substantial fraction of the observed assessment process in some streams (Table S9). In transactions and payments, an RG-proxy-only comparator achieved criterion-validity agreement comparable to or higher than the full pipeline, consistent with the possibility that operational labelling and existing RG systems are coupled through triage and feedback-loop effects (\u003cspan additionalcitationids=\"CR14\" citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). We therefore interpret the proposed approach as a definition framework that makes modelling choices and operating points explicit (leakage auditing, tail-preserving representations, regime dynamics, queue policies), rather than as a claim of universal single-metric dominance over incumbent proxies.\u003c/p\u003e \u003cp\u003eThe heavy-tail representation layer is a key methodological and interpretive contribution. Gambling activity is dominated by rare extremes rather than average behavior, and na\u0026iuml;ve clipping or long-horizon aggregation can erase precisely the signals that matter for prevention-oriented monitoring (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e). By explicitly representing extreme behavior via exceedance indicators (frequency) and exceedance magnitudes (intensity), summarized over short horizons, the framework distinguishes persistent clustering of extreme events from isolated spikes. In descriptive diagnostics, exceedance-derived features generally exhibited monotone gradients with analyst-assessed severity across streams, supporting face validity and anchoring regime outputs in observable behavioral signatures. This representation provides a behavioral narrative consistent with a dynamic risk process: escalation is often reflected in increasing persistence and clustering of extremes rather than a smooth shift in mean behavior. Some univariate associations drifted in temporal holdout (Table S10), reinforcing the need for frozen-pipeline temporal evaluation and for representations that do not rely on any single handcrafted feature remaining stable over time.\u003c/p\u003e \u003cp\u003eAcross stages, the design goal was structural alignment: treat gambling telemetry as a multi-scale, heavy-tailed process and build an analysis pipeline that respects that geometry (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e). Exceedance features capture tail events at short horizons; the hierarchical CVAE compresses correlated dynamics across time scales without washing out rare extremes; teacher\u0026ndash;student training uses abundant proxy supervision to shape the latent space and sparse gold labels to refine decision boundaries with minimal drift; BALI then corrects supervision under selective labelling; and the GHMM imposes an explicit temporal state model whose persistence can be audited and translated into queue-based review policies (\u003cspan additionalcitationids=\"CR17\" citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e). Taken together, these choices define a coherent and logical modelling story: risk as a slowly varying latent regime manifested through clustered extremes rather than a grab bag of methods.\u003c/p\u003e \u003cp\u003eThe representation learning stage was designed to summarize 30-day windows into stable low-dimensional embeddings while preventing leakage from sparse manual labels. A hierarchical conditional variational autoencoder with multiple temporal scales (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e) supports feature compression in a setting where engineered inputs are high-dimensional and strongly correlated, and the teacher\u0026ndash;student regime decouples large-scale proxy-supervised representation learning from manual-label adaptation (\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e). While such embeddings are not inherently interpretable, we used embedding-geometry diagnostics to assess whether the latent space preserved an ordinal structure aligned with manual risk categories and to characterize differences in signal across streams and temporal scales. These diagnostics do not establish clinical validity, but they provide an auditable link between modelling intent (bursts, rhythms, drift) and how information is organized in the representation, while collapse-safeguard diagnostics help verify that latents are meaningfully used (\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eA major practical challenge in this domain is selective labelling. Manual analyst assessments are operationally triaged and plausibly MNAR: windows are labelled preferentially when behavioral signals and system context (e.g., backlog) make review more likely (\u003cspan additionalcitationids=\"CR14 CR15 CR16 CR17\" citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e). To mitigate resulting selection bias, we introduced Backlog-Aware Label Inference (BALI), which models labelling propensity using embeddings and backlog covariates, applies inverse-probability weighting (\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e), and uses a conservative Bayes adjustment to pseudo-label a prespecified subset of high-confidence unlabeled windows for training. The approach was inspired by backlog and delay effects studied in actuarial claims-processing settings (\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e) but is adapted here to the selective-label setting of capacity-constrained human review workflows. Critically, pseudo-labelled windows were confined to the training split; all reported validation, test, and temporal holdout results relied exclusively on original manual assessments, preserving a strict separation between training augmentation and evaluation.\u003c/p\u003e \u003cp\u003eHidden Markov models provide a standard statistical framework for latent regime inference in time series under observation noise (\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e, \u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e). Dynamic regime modelling with a Gaussian hidden Markov model provides a natural bridge between machine learning outputs and systems-level interpretation (\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e). Regimes yield a compact state description and transition structure that can be logged, audited, and analyzed over time (\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e). In our setting, the GHMM is not a smoothing heuristic applied to a risk score; it encodes the modelling assumption that risk evolves as a persistent latent regime whose switching dynamics should be estimated explicitly and can be audited (\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e). Because transition estimation was regularized to favor persistence, persistence statistics should be interpreted as model-implied summaries under an explicit persistence prior; operationally, we leverage regime persistence through \u0026ldquo;persistence gates\u0026rdquo; that provide an explicit operational control for trading alert stability against detection and lead time (Appendix F; Tables S5\u0026ndash;S6). This is particularly important in heavy-tailed behavioral domains, where isolated extreme days may be common but not necessarily indicative of sustained risk (\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e). Regime posteriors also support calibrated probabilities that can be mapped to review budgets and policy thresholds, making governance choices explicit rather than implicit (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eAlthough online gambling is the motivating application, the methodological problem structure is broader: heavy-tailed digital traces, temporally evolving latent states, selectively observed human labels, and capacity-constrained decision workflows occur across many techno-socio-economic systems, including fraud and abuse monitoring, insurance claims triage, credit and financial risk, and platform integrity (\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e). The framework\u0026rsquo;s emphasis on leakage control, MNAR-aware supervision, dynamic state modelling, and capacity-aware evaluation is intended to be transferable to these settings.\u003c/p\u003e \u003cp\u003eSeveral limitations should be considered. First, the study uses data from a single operator; external validation across jurisdictions, products, and regulatory contexts is needed to assess generalizability. Second, manual analyst assessments are an operational reference standard rather than clinical diagnoses; they may be sparse, noisy, and potentially influenced by existing RG systems, complicating interpretation of \u0026ldquo;ground truth\u0026rdquo; and introducing feedback-loop risks (\u003cspan additionalcitationids=\"CR14\" citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). Third, early-warning analyses are offline simulations anchored to the timing of analyst assessments, which may lag underlying harm; results therefore quantify potential operational lead time rather than causal impact on harms. These results should therefore be interpreted as system-level backtests of monitoring policies under fixed data and label processes, not as estimates of intervention effectiveness. Fourth, calibration varied by stream; while temperature scaling reduced expected calibration error, any operational use would require ongoing recalibration and drift monitoring, particularly in label-scarce settings.\u003c/p\u003e \u003cp\u003eFuture work should prioritize multi-operator external validation; prospective evaluation of intervention policies derived from regime dynamics; linkage of regime-based trajectories to external outcomes (e.g., treatment contact, linked surveys, or other harm proxies); and formal audits for fairness, drift, and unintended feedback effects. Methodological extensions include multi-stream fusion and joint modelling of multiple harm endpoints while preserving the leakage controls and capacity-aware evaluation used here.\u003c/p\u003e \u003cp\u003eIn summary, heavy-tail-aware behavioral representations combined with dynamic regime modelling can support a transparent operational proxy definition of gambling-related risk and enable explicit evaluation of early warning and operating points under realistic review-capacity constraints.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cdiv class=\"DefinitionList\"\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eAUPRC\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eArea under the precision\u0026ndash;recall curve\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eBA\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eBalanced accuracy\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eBALI\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eBacklog-aware label inference\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eCI\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eConfidence interval\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eCVAE\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eConditional variational autoencoder\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eDSM-5\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eDiagnostic and Statistical Manual of Mental Disorders, 5th edition\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eECE\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eExpected calibration error\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eF1-score\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eHarmonic mean of precision and recall\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eGHMM\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eGaussian hidden Markov model\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eKL\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eKullback\u0026ndash;Leibler\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eLSTM\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eLong short-term memory\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eMNAR\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eMissing not at random\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eNNR\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eNumber needed to review\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003ePCA\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003ePrincipal component analysis\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003ePGSI\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eProblem Gambling Severity Index\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003ePOT\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003ePeaks over threshold\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003ePPV\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003ePositive predictive value\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003ePR\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eParticipation ratio\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eRF\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eRandom forest\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eRG\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eResponsible gambling\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eSNPS\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eStandardized normalized partial sums\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eTCN\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eTemporal convolutional network\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eVAE\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eVariational autoencoder\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003eAvailability of data and materials\u003c/p\u003e\n\u003cp\u003eRaw behavioral event data underpinning this study are subject to contractual and privacy restrictions with the participating operator and cannot be shared publicly. Access to a de-identified analysis extract may be considered for qualified researchers on reasonable request, subject to operator approval and appropriate data protection and confidentiality agreements. The code used to generate features, enforce leakage audits, train models, and (given access to the restricted analysis extract) reproduce all tables and figures is available at the project repository (52). An archived snapshot corresponding to this manuscript is available on Zenodo (53) (release tag: v0.1.0) under the GNU General Public License v3.0 (GPL-3.0-or-later). The public repository includes configuration files, run-order scripts, and a synthetic demo dataset sufficient to run a smoke-test of the downstream pipeline (BALI\u0026rarr;GHMM) without exposing personal data.\u003c/p\u003e\n\u003cp\u003eAvailability and requirements (software)\u003c/p\u003e\n\u003cp\u003eProject name: Definition Study (EPJ Data Science) \u0026mdash; Reproducibility Repository (definition-study-epjds)\u003c/p\u003e\n\u003cp\u003eProject home page: https://github.com/SamAndersson-C/definition-study-epjds\u003c/p\u003e\n\u003cp\u003eArchived version: Zenodo release DOI: https://doi.org/10.5281/zenodo.18653580 (release tag: v0.1.1)\u003c/p\u003e\n\u003cp\u003eOperating system(s): Platform independent\u003c/p\u003e\n\u003cp\u003eProgramming language: Python (\u0026ge;3.9)\u003c/p\u003e\n\u003cp\u003eOther requirements: See requirements-demo.txt (synthetic demo) and requirements.in (reference internal environment). A standard Python virtual environment is recommended.\u003c/p\u003e\n\u003cp\u003eLicense: GNU General Public License v3.0 (GPL-3.0-or-later)\u003c/p\u003e\n\u003cp\u003eAny restrictions to use by non-academics: None beyond the GPL-3.0-or-later license terms. Restricted operator data are not included in this repository.\u003c/p\u003e\n\u003cp\u003eEthics\u003c/p\u003e\n\u003cp\u003eThe study procedures were carried out in accordance with the Declaration of Helsinki. The study was reviewed and approved by the Swedish Ethical Review Authority (Dnr 2023-07288-02). Informed consent was waived by the review board to permit research on pre-existing registry data.\u003c/p\u003e\n\u003cp\u003eConsent for publication\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003eCompeting interests\u003c/p\u003e\n\u003cp\u003eThis study was conducted as part of an industry\u0026ndash;academia collaboration on Responsible Gambling financed by the LeoVegas Group, a licensed gambling operator in Sweden. The research was planned, conducted, and submitted under full academic freedom, as guaranteed by a written agreement. The funder had no role in the study design or conduct, data analysis or interpretation, or the decision to publish. SA\u0026rsquo;s doctoral position is financed by the LeoVegas Group; SA is employed by Karolinska Institutet and reports no other competing interests. PL and PC report past and ongoing industry\u0026ndash;academia collaborations with multiple gambling providers, including project-specific research funding, and report no personal financial ties to the gambling industry. OM has received funding from the Independent Research Council of Svenska Spel for clinical studies unrelated to the present study. All other authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003eFunding\u003c/p\u003e\n\u003cp\u003eThis study was funded by the LeoVegas Group, a licensed gambling operator in Sweden.\u003c/p\u003e\n\u003cp\u003eAuthors\u0026apos; contributions\u003c/p\u003e\n\u003cp\u003eSA conceptualized the study and methodology, developed and implemented the software pipeline, conducted the analyses, and wrote the manuscript. PL and PC contributed to the conceptualization by proposing the initial clinically motivated research question (to develop a proxy definition of problem gambling from behavioral data) and through scientific discussions. HW, TK, and KL contributed to study conceptualization and methodological development through detailed scientific and methodological discussions. PL secured funding. All authors (HW, TK, KL, PC, PL, and OM) reviewed the manuscript. PL and OM were responsible for project administration.\u003c/p\u003e\n\u003cp\u003eAcknowledgements\u003c/p\u003e\n\u003cp\u003eI would like to acknowledge the Department of Mathematical Statistics at Stockholm University and its seminar series for providing a stimulating environment and many valuable ideas that informed this work.\u003c/p\u003e\n\u003cp\u003eAI Use Declaration\u003c/p\u003e\n\u003cp\u003eDuring the preparation of this manuscript, the authors used the Claude Code API (Anthropic) and ChatGPT (OpenAI) to assist with language editing and improve clarity and readability. The authors reviewed and edited the outputs as needed and take full responsibility for the content of the publication.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAuthors\u0026apos; information\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003eAdditional files\u003c/p\u003e\n\u003cp\u003eAdditional file 1: Supplementary Appendix (DOCX). Extended methods, additional diagnostics, and additional tables/figures supporting the main text.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eLazer D, Pentland A, Adamic L, Aral S, Barab\u0026aacute;si AL, Brewer D et al (2009) Computational social science. Science 323(5915):721\u0026ndash;723\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKleinberg J, Lakkaraju H, Leskovec J, Ludwig J, Mullainathan S (2018) Human decisions and machine predictions. Q J Econ 133(1):237\u0026ndash;293\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang X, Pleimling M (2019) Online gambling of pure chance: wager distribution, risk attitude, and anomalous diffusion. Sci Rep 9:14712\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBarab\u0026aacute;si AL (2005) The origin of bursts and heavy tails in human dynamics. Nature 435(7039):207\u0026ndash;211\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eClauset A, Shalizi CR, Newman MEJ (2009) Power-law distributions in empirical data. SIAM Rev 51(4):661\u0026ndash;703\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWardle H, Degenhardt L, Marionneau V et al (2024) The Lancet Public Health Commission on gambling. Lancet Public Health 9(11):e950\u0026ndash;e994. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/S2468-2667(24)00167-1\u003c/span\u003e\u003cspan address=\"10.1016/S2468-2667(24)00167-1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAllami Y (2024) Strengthening oversight and integrity: The multi-faceted role of centralized player tracking systems in gambling. Addiction 119(7):1170\u0026ndash;1171\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAuer M, Griffiths MD (2023) The relationship between structural characteristics and gambling behaviour: an online gambling player tracking study. J Gambl Stud 39(1):265\u0026ndash;279\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFerris J, Wynne H (2001) The Canadian Problem Gambling Index: final report. Canadian Centre on Substance Abuse, Ottawa\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCurrie SR, Hodgins DC, Casey DM (2013) Validity of the Problem Gambling Severity Index interpretive categories. J Gambl Stud 29(2):311\u0026ndash;327\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMiller NV, Currie SR, Hodgins DC, Casey D (2013) Validation of the Problem Gambling Severity Index using confirmatory factor analysis and Rasch modelling. Int J Methods Psychiatr Res 22(3):245\u0026ndash;255\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBraverman J, LaPlante DA, Nelson SE, Shaffer HJ (2013) Using cross-game behavioral markers for early identification of high-risk internet gamblers. Psychol Addict Behav 27(3):868\u0026ndash;877\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGhaharian K, Abarbanel B, Phung D, Puranik P, Kraus S, Feldman A et al (2023) Applications of data science for responsible gambling: a scoping review. Int Gambl Stud 23(2):289\u0026ndash;312\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMagnusson K, Nilsson A, Andersson G, Hellner C, Carlbring P (2019) Level of agreement between problem gamblers\u0026rsquo; and collaterals\u0026rsquo; reports: a Bayesian random-effects two-part model. J Gambl Stud 35(4):1127\u0026ndash;1145\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMurch WS, Kairouz S, French M (2024) Establishing the temporal stability of machine learning models that detect online gambling-related harms. Comput Hum Behav Rep 14:100427\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLakkaraju H, Kleinberg J, Leskovec J, Ludwig J, Mullainathan S (2017) The selective labels problem: evaluating algorithmic predictions in the presence of unobservables. In: Proc 23rd ACM SIGKDD Int Conf Knowl Discov Data Min\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDe-Arteaga M, Fogliato R, Chouldechova A, G\u0026rsquo;Sell M (2018) Learning under selective labels in the presence of expert consistency. arXiv:1807.00905\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRubin DB (1976) Inference and missing data. Biometrika 63(3):581\u0026ndash;592\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAhmad Z, Mahmoudi E, Hamedani GG, Kharazmi O (2020) New methods to define heavy-tailed distributions with applications to insurance data. J Taibah Univ Sci 14(1):359\u0026ndash;382\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKaufman S, Rosset S, Perlich C, Stitelman O (2012) Leakage in data mining: formulation, detection, and avoidance. ACM Trans Knowl Discov Data 6(4):15\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKingma DP, Welling M (2013) Auto-Encoding Variational Bayes. arXiv:1312.6114\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSohn K, Lee H, Yan X (2015) Learning structured output representation using deep conditional generative models. Adv Neural Inf Process Syst 28:3483\u0026ndash;3491\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv:1503.02531\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLindskog F, W\u0026uuml;thrich MV (2025) Eliciting claims development patterns and costs hidden in backlogs. Eur Actuar J 15(3):667\u0026ndash;705\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257\u0026ndash;286\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGuo C, Pleiss G, Sun Y, Weinberger KQ (2017) On calibration of modern neural networks. In: Proc 34th Int Conf Mach Learn (ICML). PMLR 70:1321\u0026ndash;1330\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins JM (2018) Double/debiased machine learning for treatment and structural parameters. Econometrics J 21(1):C1\u0026ndash;C68\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eColes S (2001) An introduction to statistical modeling of extreme values. Springer, London\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePage ES (1954) Continuous inspection schemes. Biometrika 41(1\u0026ndash;2):100\u0026ndash;115\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eColes SG, Heffernan J, Tawn JA (1999) Dependence measures for extreme value analyses. Extremes 2(4):339\u0026ndash;365\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735\u0026ndash;1780\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBreiman L (2001) Random forests. Mach Learn 45(1):5\u0026ndash;32\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. arXiv:1706.03762\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHe J, Spokoyny D, Neubig G, Berg-Kirkpatrick T (2019) Lagging inference networks and posterior collapse in variational autoencoders. In: Proc Int Conf Learn Represent (ICLR)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBowman SR, Vilnis L, Vinyals O, Dai AM, Jozefowicz R, Bengio S (2016) Generating sentences from a continuous space. In: Proc 20th SIGNLL Conf Comput Nat Lang Learn (CoNLL). pp 10\u0026ndash;21\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKingma DP, Salimans T, Jozefowicz R, Chen X, Sutskever I, Welling M (2016) Improving variational inference with inverse autoregressive flow. arXiv:1606.04934\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMiyato T, Kataoka T, Koyama M, Yoshida Y (2018) Spectral normalization for generative adversarial networks. arXiv:1802.05957\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCox DR (1972) Regression models and life-tables. J R Stat Soc Ser B 34(2):187\u0026ndash;220\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSeaman SR, White IR (2013) Review of inverse probability weighting for dealing with missing data. Stat Methods Med Res 22(3):278\u0026ndash;295\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu H, Lafferty J, Wasserman L (2009) The nonparanormal: semiparametric estimation of high dimensional undirected graphs. J Mach Learn Res 10:2295\u0026ndash;2328\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFox EB, Sudderth EB, Jordan MI, Willsky AS (2011) A sticky HDP-HMM with application to speaker diarization. Ann Appl Stat 5(2A):1020\u0026ndash;1056\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEfron B, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman \u0026amp; Hall, London\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSaito T, Rehmsmeier M (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10(3):e0118432\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCatania M, Griffiths MD (2021) Applying the DSM-5 criteria for gambling disorder to online gambling account-based tracking data: an empirical study utilizing cluster analysis. J Gambl Stud 38(4):1289\u0026ndash;1306\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMurch WS, Kairouz S, Dauphinais S, Picard E, Costes J, French M (2023) Using machine learning to retrospectively predict self-reported gambling problems in Quebec. Addiction 118(8):1569\u0026ndash;1578\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCapp\u0026eacute; O, Moulines E, Ryd\u0026eacute;n T (2005) Inference in hidden Markov models. Springer, New York\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSculley D, Holt G, Golovin D et al (2015) Hidden technical debt in machine learning systems. Adv Neural Inf Process Syst 28:2503\u0026ndash;2511\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHill BM (1975) A simple general approach to inference about the tail of a distribution. Ann Stat 3(5):1163\u0026ndash;1174\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBolton RJ, Hand DJ (2002) Statistical fraud detection: a review. Stat Sci 17(3):235\u0026ndash;255\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKe G, Meng Q, Finley T et al (2017) LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30:3146\u0026ndash;3154\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAndersson S (2026) Definition Study (EPJ Data Science) \u0026mdash; Reproducibility Repository (definition-study-epjds). GitHub repository. Available at: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/SamAndersson-C/definition-study-epjds\u003c/span\u003e\u003cspan address=\"https://github.com/SamAndersson-C/definition-study-epjds\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (Accessed 16 Feb 2026)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAndersson S (2026) Definition Study (EPJ Data Science) \u0026mdash; Reproducibility Repository (definition-study-epjds) (v0.1.1). Zenodo. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.5281/zenodo.18653580\u003c/span\u003e\u003cspan address=\"10.5281/zenodo.18653580\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (Accessed 16 Feb 2026)\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"epj-data-science","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"epds","sideBox":"Learn more about [EPJ Data Science](https://epjdatascience.springeropen.com/)","snPcode":"13688","submissionUrl":"https://submission.springernature.com/new-submission/13688/3","title":"EPJ Data Science","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Open","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"problem gambling, operational risk definition, heavy-tailed behavior, representation learning, variational autoencoder, hidden Markov model, early warning, capacity-constrained triage, missing-not-at-random labels","lastPublishedDoi":"10.21203/rs.3.rs-8889984/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8889984/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eBackground: Problem gambling causes harm, but operational identification often relies on heuristic thresholds or sparse manual reviews. Routine online gambling logs are heavy-tailed and temporally structured, complicating risk definition and early detection.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eMethods: We analysed de-identified records from an online gambling operator across four streams (transactions, bets, sessions, payments). Time series were summarised into leakage-audited 30-day windows with heavy-tail-aware exceedance frequency and magnitude features. Window embeddings were learned using a hierarchical conditional variational\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eautoencoder: a teacher trained on responsible-gambling proxy signals, then a student fine-tuned on sparse manual analyst assessments on the training split only. To address missing-not-at-random assessments, backlog-aware label inference conservatively augmented training data. Dynamic regimes were inferred from embeddings using a regularised Gaussian hidden Markov model, yielding a three-class operational proxy definition. Agreement with analyst assessments and early-warning utility under explicit capacity constraints were evaluated on held-out labels.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eResults: Balanced accuracy on labelled test windows ranged from 0.38 (transactions) to 0.62 (payments), with best macro-averaged F1-score in bets (0.54). Under a capacity-constrained top-10-per-week queue, escalation detection ranged from 0.39 (sessions) to 0.62 (bets), with median lead times of 42–290 days.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eConclusions: Heavy-tail-aware representations combined with dynamic regime modelling can derive an auditable operational proxy definition of gambling-related risk from routine data and support realistic, capacity-constrained monitoring.\u003c/p\u003e","manuscriptTitle":"Heavy-tail-aware representation learning and dynamic Bayesian state modelling to derive an operational proxy definition of problem gambling risk from routine online gambling data","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-12 09:33:50","doi":"10.21203/rs.3.rs-8889984/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"editorInvitedReview","content":"","date":"2026-04-03T11:55:03+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"274324533740338675834655343439692218601","date":"2026-04-01T10:03:13+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-03-16T09:28:33+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-02-16T09:29:05+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-02-16T09:27:44+00:00","index":"","fulltext":""},{"type":"submitted","content":"EPJ Data Science","date":"2026-02-16T05:38:54+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"epj-data-science","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"epds","sideBox":"Learn more about [EPJ Data Science](https://epjdatascience.springeropen.com/)","snPcode":"13688","submissionUrl":"https://submission.springernature.com/new-submission/13688/3","title":"EPJ Data Science","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Open","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"7f51532d-520e-4835-8f18-5008da141738","owner":[],"postedDate":"March 12th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-03-16T09:40:56+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-12 09:33:50","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8889984","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8889984","identity":"rs-8889984","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-20T11:00:21.680559+00:00

License: CC-BY-4.0