A Leakage-Aware Data Layer For Student Analytics: The Capire Framework For Multilevel Trajectory Modeling | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article A Leakage-Aware Data Layer For Student Analytics: The Capire Framework For Multilevel Trajectory Modeling Hugo Roger Paz This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8118343/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Predictive models for student dropout, while often accurate, frequently rely on opportunistic feature sets and suffer from undocumented data leakage, limiting their explanatory power and institutional usefulness. This paper introduces a leakage-aware data layer for student trajectory analytics, which serves as the methodological foundation for the CAPIRE framework for multilevel modelling. We propose a feature engineering design that organizes predictors into four levels: N1 (personal and socio-economic attributes), N2 ( entry moment and academic history ), N3 ( curricular friction and performance ), and N4 (institutional and macro-context variables)As a core component, we formalize the Value of Observation Time (VOT) as a critical design parameter that rigorously separates observation windows from outcome horizons, preventing data leakage by construction. An illustrative application in a long-cycle engineering program (1,343 students, ~ 57% dropout) demonstrates that VOT-restricted multilevel features support robust archetype discovery. A UMAP + DBSCAN pipeline uncovers 13 trajectory archetypes, including profiles of "early structural crisis," "sustained friction," and "hidden vulnerability" (low friction but high dropout). Bootstrap and permutation tests confirm these archetypes are statistically robust and temporally stable. We argue that this approach transforms feature engineering from a technical step into a central methodological artifact. This data layer serves as a disciplined bridge between retention theory, early-warning systems, and the future implementation of causal inference and agent-based modelling (ABM) within the CAPIRE program. Feature Engineering Learning Analytics Student Retention Data Leakage Early-Warning Systems Archetype Discovery Value of Observation Time (VOT) Multilevel Modelling Educational Data Mining Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 1. INTRODUCTION 1.1. Motivation: Beyond dropout prediction towards explanatory frameworks Student attrition in higher education remains a structurally persistent problem rather than a marginal anomaly [15]. Global estimates suggest that roughly one third of students who begin a tertiary programme do not complete it, with even higher non-completion in many Latin American systems [16]. This loss of human capital constrains social mobility, reinforces inequality, and undermines institutional missions, especially in regions already affected by deep socio-economic asymmetries and the post-pandemic learning crisis (UNESCO, 2019). Over the last decade, Educational Data Mining (EDM) and Learning Analytics (LA) have produced increasingly accurate early-warning models that identify students at risk of dropout or academic failure using administrative data, LMS logs, and assessment records. Studies in distance, blended, and on-campus contexts show that machine-learning models can reach AUCs above 0.80 using combinations of grades, attendance, and clickstream features (e.g., Andrade-Girón et al., 2023). Recent reviews confirm that feature engineering and feature selection are central levers for performance in EDM pipelines (Koukaras & Tjortjis, 2025). However, three structural limitations remain. First, opacity: high-capacity models are often “black boxes” that return risk scores without intelligible explanations, which undermines trust and hampers the design of targeted interventions. Second, correlation–causation conflation: predictive success does not clarify which mechanisms actually drive retention or which interventions will work for whom. Third, a symptom-level focus: many models treat observable behaviours (e.g., missed classes, LMS inactivity) as explanatory variables rather than manifestations of deeper psychosocial, neurobiological, or structural processes. As a result, early-warning systems may correctly flag “who” is at risk but provide little insight into “why” students struggle or “how” institutions should respond. 1.2. Methodological gaps: Feature engineering and temporal validity EDM and LA studies also face more technical, but equally consequential, methodological gaps. A recurrent issue is the opportunistic construction of features, driven by convenience rather than theory. Predictors are often assembled from whichever variables happen to be available in institutional databases, leading to flat feature spaces that under-represent socio-economic structure, curricular design, and institutional dynamics. This practice limits both interpretability and transferability across contexts. A second, increasingly recognised problem is data leakage. Many published models inadvertently incorporate information from the future into the observation window—such as grades obtained after the supposed prediction point—thereby inflating accuracy estimates and compromising the validity of early-warning claims. Leakage is rarely documented explicitly, and temporal design decisions (e.g., where to cut the observation window) are often left implicit or ambiguous. Taken together, these gaps hinder the development of explanatory frameworks that connect retention theory with predictive modelling and institutional decision-making. What is missing is not yet another classifier, but a disciplined way of transforming raw longitudinal data into temporally honest, theory-informed feature spaces that can support both archetype discovery and early-warning systems. 1.3. Aim and contribution of this paper This paper addresses these gaps by introducing a leakage-aware data layer for student trajectory analytics, which serves as the methodological foundation for the CAPIRE (Comprehensive Analytics Platform for Institutional Retention Engineering) framework. The proposed design organises predictors into four levels: N1 (personal and socio-economic attributes), N2 (academic history and friction indicators), N3 (curricular structure and workload), and N4 (institutional and macro-context variables). As a core component, we formalise the Value of Observation Time (VOT) as a design parameter that rigorously separates observation windows from outcome horizons, preventing data leakage by construction. An illustrative application in a long-cycle engineering programme (1,343 students, ~57% dropout) demonstrates that VOT-restricted multilevel features support robust archetype discovery. A UMAP + DBSCAN pipeline uncovers 13 trajectory archetypes, including profiles of “early structural crisis”, “sustained friction”, and “hidden vulnerability” (low friction but high dropout). Bootstrap and permutation tests confirm that these archetypes are statistically robust and temporally stable, while a dedicated analysis of DBSCAN-labelled “noise” reveals coherent minority micro-archetypes rather than heterogeneous outliers. In this article, we focus specifically on the construction and empirical validation of the leakage-aware data layer and the associated archetype model. Although the broader CAPIRE roadmap includes causal inference, explainable AI, and agent-based modelling, these components fall outside the scope of the present work. Here, we concentrate on demonstrating that a carefully engineered, VOT-compliant feature space can act as a reusable bridge between retention theory, early-warning systems, and future causal and simulation-based analyses. 2. BACKGROUND AND RELATED WORK Student attrition in higher education is a multidimensional phenomenon shaped by sociological, psychological, and institutional forces. This section positions the CAPIRE framework within contemporary research, outlining foundational theories, methodological advances in educational data mining, and structural gaps that justify a rigorous multilevel and leakage-aware approach. 2.1. Theoretical Foundations of Student Retention Classical models conceptualised student departure either as an individual deficit or as a failure of institutional integration. Spady’s (1970) sociological model framed attrition as the result of insufficient academic and social fit, while Tinto’s (1975) Integration Model argued that persistence emerges from successful engagement with the academic and social systems of the university. Later revisions (Tinto, 2017) recognised substantial heterogeneity among non-traditional students, emphasising the interplay of life circumstances, motivations, and structural constraints. Sociological perspectives expanded the analytical lens to structural determinants. Bourdieu’s ( 1986 ) theory of cultural capital highlighted how socioeconomic origin mediates academic performance independently of formal ability. Three decades of research summarised by Pascarella and Terenzini (2005) confirmed that pre-entry characteristics, academic preparation, and institutional context interact non-linearly, undermining models that assume additive and independent effects. Psychological approaches introduced motivational and identity-based mechanisms. Bean and Eaton (2000), drawing on Bandura’s (1997) self-efficacy theory, argued that beliefs about academic capability mediate the effect of institutional experiences on persistence. Rendón’s (1994) concept of validation underscored the importance of faculty and peer affirmation in sustaining engagement, particularly among first-generation and minoritised students. Table 2.1 Major Theoretical Frameworks in Student Retention Framework Key Mechanism Level of Analysis Relevance to CAPIRE Tinto (1975, 2017) Academic and social integration Individual–Institutional Justifies multilevel N1–N3 features Bourdieu ( 1986 ) Transmission of cultural capital Socioeconomic–Institutional N1 features (SES, parental education) Bean & Eaton (2000) Self-efficacy and coping mechanisms Psychological–Academic N3 friction and early performance Astin (1993) Input–Environment–Outcome (I-E-O) Multilevel N2 inputs, N3/N4 environment Braxton et al. (2004) Revised integration model for commuter students Institutional N4 institutional policies Synthesis These perspectives converge on a multilevel ontology: retention arises from interactions between pre-entry characteristics (N1–N2), institutional structures (N3–N4), and students’ agentic responses. CAPIRE operationalises these theories through feature engineering rather than latent variable modelling, privileging predictive validity, interpretability, and replicability. 2.2. Feature Engineering in Educational Data Mining The maturation of institutional data warehouses—student information systems (SIS) and learning management systems (LMS)—shifted retention research from survey-driven analysis to data-intensive predictive modelling (Romero & Ventura, 2020). Early work centred on academic indicators such as GPA and credits earned (Delen, 2010; Kabakchieva, 2013), achieving moderate predictive accuracy but suffering from three structural problems: (1) temporal invalidity due to post-hoc GPA usage, (2) limited actionability, and (3) reliance on opaque black-box models. Subsequent efforts emphasised behavioural indicators. Purdue University's Course Signals (Arnold & Pistilli, 2012) integrated early assessments and LMS engagement metrics to provide real-time risk alerts, though its reliance on within-course behaviour restricted its ability to flag students with minimal engagement. Deep learning approaches incorporating clickstream sequences (Hu & Rangwala, 2020; Whitehill et al., 2017) improved accuracy (82–87%) but further reduced interpretability. Curricular friction emerged as a parallel line of inquiry. Adelman’s (2006) “momentum points” and Seidman’s (2005) notion of “gateway courses” highlighted structural bottlenecks in academic progress, yet few studies operationalised friction at the individual level. Bowen et al. (2009) analysed course-level pass rates but did not incorporate these into student-specific feature vectors. CAPIRE addresses this gap with the Instructional Friction Coefficient (IFC), a weighted metric combining course-level failure and withdrawal rates into personalised friction trajectories. Table 2.2 Evolution of Feature Sets in Attrition Prediction Era Typical Features Example Studies Accuracy Range Limitation Pre-2000 Demographics, SAT, HS GPA Tinto (1975), Cabrera et al. (1992) – Survey-based, small samples 2000–2010 + College GPA, credits Herzog (2005), Delen (2010) 68–75% Temporal leakage 2010–2015 + LMS behaviour Arnold & Pistilli (2012), Macfadyen & Dawson (2010) 75–82% No longitudinal modelling 2015–2020 + Deep learning sequences Hu & Rangwala (2020) 82–87% Non-interpretable 2020–present + Multilevel SES + curriculum + trajectories Gardner et al. (2019), CAPIRE 88–95% Requires rich administrative data Synthesis CAPIRE integrates pre-entry capital (N1–N2), curricular friction (N3), and temporal dynamics (N4) into a unified, theory-driven taxonomy. Unlike ad-hoc feature selection common in EDM, CAPIRE ensures cross-institutional portability and methodological transparency. 2.3. Multilevel Models and Interaction Effects Educational outcomes exhibit nested structure—students within courses, within programmes, within institutions—and hierarchical linear models (HLM; Raudenbush & Bryk, 2002) were developed to capture variance at each level. Empirical studies, such as Engberg and Wolniak (2010), have demonstrated cross-level moderation effects: for instance, the relationship between first-year GPA and persistence varies according to institutional selectivity, highlighting peer- and environment-dependent dynamics. Nevertheless, uptake of HLM in EDM has been limited due to computational burden, distributional assumptions, and weaker predictive performance compared with machine learning (James et al., 2013). CAPIRE adopts a pragmatic alternative: theorised interaction terms engineered directly into the feature set, preserving the multilevel ontology while enabling scalable model training. Research substantiates the importance of such interactions. Goldrick-Rab (2006) showed that financial aid effects vary by academic preparation, while Stinebrickner and Stinebrickner (2014) documented behavioural interactions between study habits and assessment timing. CAPIRE systematically constructs interaction terms (Section 4.6 ), with empirical analysis demonstrating that these features explain 37% of total model gain. 2.4. Data Leakage in Predictive Modelling Data leakage—the inadvertent introduction of future or target-derived information into the training process—is pervasive in learning analytics and severely compromises real-world performance (Kaufman et al., 2020). Common leakage pathways include: Temporal leakage : using cumulative GPA or credits earned after the prediction horizon. Target leakage : including proxies for the outcome (e.g., number of semesters completed). Label leakage : constructing labels using post-prediction behaviours. Train–test contamination : fitting preprocessing steps on the full dataset. Kaufman et al. (2020) found evidence of leakage in 78% of 50 audited EDM papers, inflating accuracy by a median of 12 percentage points. CAPIRE prevents leakage through strict Vulnerability Observation Time (VOT) filtering, temporal slicing at feature-engineering time, prohibition of post-hoc aggregates, and automated configuration logging. In contrast to other approaches—such as Course Signals (Arnold & Pistilli, 2012) or survival models using cumulative features (Gardner et al., 2019)—CAPIRE enforces temporal validity across the entire pipeline. 2.5. Topological and Archetype-Based Approaches Traditional clustering methods (k-means, hierarchical clustering) partition students into discrete, mutually exclusive groups, but educational trajectories often exhibit continuous transitions. Topological Data Analysis (TDA) explicitly represents such structure by preserving connectivity in high-dimensional data (Lum et al., 2013). Mapper (Singh et al., 2007) constructs a topological network via low-dimensional projections, interval coverings, and local clustering, revealing flares, loops, and multi-path trajectories. Applied to student data, Mapper has uncovered progression types and dispersed outlier populations (Chodrow et al., 2021), though the resulting micro-clusters (~ 50 per model) complicate interpretability. Our attempts to apply Mapper (Section 7.3.1 ) produced ~ 50 micro-clusters (mean size 27), insufficient for actionable archetype design. DBSCAN on UMAP embeddings provided a superior balance between expressiveness and operational usefulness, yielding 13 interpretable archetypes with adequate population size (40–109 students). 2.6. Gaps Addressed by CAPIRE Despite substantial advances, key gaps persist: Lack of systematic feature-engineering frameworks , limiting external validity. Inadequate handling of temporal leakage , producing overstated accuracy. Interpretability trade-offs , with deep models often opaque. Limited actionability , with risk prediction unconnected to intervention design. Poor generalisability , due to heterogeneous institutional contexts. CAPIRE contributes: a reusable, theory-grounded feature taxonomy (N1–N4); formalised leakage prevention through VOT; interpretable archetypes linked to mechanisms and interventions; guidelines for cross-institutional adaptation. The empirical demonstration in Section 7 validates these contributions and confirms CAPIRE’s viability for institutional decision-making. 3. THE CAPIRE FRAMEWORK: CONCEPTUAL OVERVIEW The CAPIRE (Comprehensive Analytics Platform for Institutional Retention Engineering) framework represents a shift from black-box dropout prediction to theory-driven, multilevel feature engineering designed for institutional use. Rather than producing opaque risk scores, CAPIRE generates interpretable trajectory archetypes that summarise distinct patterns of student progression. These archetypes support differentiated interventions aligned with specific mechanisms of vulnerability, bridging the gap between predictive modelling and educational practice. This section presents CAPIRE’s design principles, multilevel architecture, archetype-based view of student trajectories, and its role within institutional decision-making cycles. 3.1. Design Principles CAPIRE is guided by four design principles that differentiate it from conventional early-warning systems. Table 3.1 CAPIRE Design Principles Principle Rationale Operationalisation Contrast with standard approaches Multilevel Educational outcomes emerge from nested contexts (individual, curricular, institutional, societal). Four-level feature taxonomy (N1–N4) capturing pre-entry, entry, curricular, and temporal–institutional factors. Most EDM models use flat feature spaces, ignoring cross-level interactions. Explanatory Predictions must be intelligible and mechanistically grounded to inform practice. Feature importance analysis and archetype profiling reveal why students are vulnerable. Black-box models (deep learning, complex ensembles) privilege accuracy over explanation. Leakage-aware Temporal validity is a precondition for deployment, not an optional refinement. Vulnerability Observation Time (VOT) enforces strict temporal boundaries on feature construction. A large share of published EDM work contains unaddressed temporal leakage. Policy-oriented Analytics should connect directly to institutional levers and support design. Archetype-to-intervention matrices specify differentiated programmes and services. Standard risk scores rarely specify what to do or for whom. These principles reflect a pragmatic epistemology: CAPIRE prioritises operational usefulness, transparency, and temporal honesty over maximal algorithmic sophistication. While multilevel structural models or causal graphical frameworks offer theoretical elegance, CAPIRE combines disciplined feature engineering with robust machine learning to achieve interpretable, scalable insights that can be embedded in everyday institutional workflows. 3.2. Multilevel Architecture: The N1–N4 Feature Taxonomy CAPIRE organises predictors into four analytically distinct but empirically interacting levels, aligned with socio-ecological models of student development (Bronfenbrenner, 1979; Pascarella & Terenzini, 2005). Figure 3.1 provides a conceptual overview. N1: Pre-entry context (structural conditions) Theoretical grounding Bourdieu’s ( 1986 ) cultural capital; Lareau’s (2011) unequal childhoods. N1 features describe the structural context in which students are socialised prior to entering higher education, including: Neighbourhood deprivation : indices of poverty, housing quality, and educational access linked to postcode (e.g., NBI in the Argentine context). Family educational capital : parental educational attainment, siblings with university attendance. Geographical origin : rural/urban status; distance to campus as a proxy for commuting burden and social integration costs. Key insight In our empirical setting, N1 variables display limited direct predictive importance but exert indirect effects through downstream mechanisms (e.g., poverty → need to work while studying → increased exposure to high-friction courses). CAPIRE therefore preserves N1 information to support interpretation and fairness analysis, even when its marginal contribution to accuracy is modest. N2: Entry moment (initial conditions) Theoretical grounding Astin’s (1993) Input–Environment–Outcome (I–E–O) model; Tinto’s (1975) pre-entry attributes. N2 features capture characteristics at, or immediately surrounding, the moment of enrolment: Demographics : age at entry, gender, marital status. Employment status : whether the student works while studying and, when available, approximate workload. Academic preparation : upper-secondary performance, entrance examination scores where applicable. Macro-contextual conditions : year-of-entry indicators such as inflation, unemployment, or prolonged strikes in the public system. Key insight Age at entry emerges as one of the most influential predictors in the CAPIRE case study, consistently ranking among the top features. This aligns with evidence that “non-traditional” entrants face distinct constraints and opportunity costs. N3: Curricular structure and academic friction Theoretical grounding Adelman’s (2006) momentum points; Seidman’s (2005) gateway/chokepoint courses. N3 features characterise how students interact with the curriculum during the observation window defined by VOT: Performance metrics : grades, pass/fail coUniversity Xs, number of attempts per course. Enrolment patterns : subjects attempted, dropped, or re-taken. Curricular friction : the Instructional Friction Coefficient (IFC), a weighted measure of course-level difficulty based on withdrawal and failure patterns. Formally, for course \(\:j\) , and for student \(\:i\) , $$\:{\text{IFC}}_{\text{mean}}^{\left(i\right)}=\frac{1}{\mid\:{C}_{i}\mid\:}\sum\:_{j\in\:{C}_{i}}{\text{IFC}}_{j},$$ where \(\:{C}_{i}\) is the set of courses attempted by student \(\:i\) up to the VOT cut-off. Key insight IFC-based features dominate the importance ranking in our empirical models, confirming that friction—rather than grades alone—captures critical aspects of curricular vulnerability. N4: Trajectory dynamics (temporal processes) Theoretical grounding life-course approaches (Elder, 1998); state-transition and Markov chain perspectives on educational progression. N4 features encode how students move through time, rather than what they look like at a single snapshot: Enrolment gaps : longest gap between consecutive active terms or course enrolments. Load trends : change in course load over time, often modelled as the slope in a simple regression of courses-per-term on term number. State entropy : diversity of academic states (passed, failed, dropped, not attempted), computed as Shannon entropy $$\:H=-\sum\:_{s\in\:S}p\left(s\right){\text{l}\text{o}\text{g}}_{2}p\left(s\right),$$ where \(\:S\) is the set of states and \(\:p\left(s\right)\) their empirical probabilities. Velocity of advance : ratio of completed courses to those expected by the nominal curriculum at each time point. Key insight N4 variables accoUniversity X for several of the top predictors in our feature importance analyses, underscoring that temporal structure—interruptions, non-linear progress, and volatility—contributes information not captured by static indicators. Cross-level interactions Educational processes are fundamentally interactive: the effect of one feature often depends on another. Rather than relying solely on hierarchical models, CAPIRE engineers’ interaction terms guided by theory. Examples include: N1 × N3 : socio-economic deprivation × pass rate, to model how poverty amplifies the impact of academic failure. N2 × N4 : age at entry × average number of attempts, capturing heightened sensitivity of older students to repeated failure. N3 × N3 : combinations of friction and withdrawal rates, representing compound academic risk. N3 × N4 : exposure to high-IFC courses × maximum gap, reflecting the vulnerability of interrupted trajectories in demanding curricula. Empirical analyses (Section 7.5.4 ) show that such interaction terms accoUniversity X for a substantial proportion of total model gain, indicating that multilevel thinking is not merely conceptually elegant but empirically necessary. 3.3. From risk scores to trajectory archetypes Most early-warning systems produce individual-level risk scores (e.g., an estimated probability of dropout within the next year). While useful for prioritising outreach, these scores suffer from three limitations: Limited explanatory power : a high risk value rarely clarifies whether the underlying mechanism is academic friction, financial stress, social isolation, or misalignment of expectations. Homogenisation of heterogeneity : students with similar predicted risk may require very different forms of support. Weak link to practice : risk scores do not specify concrete, differentiated actions. CAPIRE shifts focus from isolated probabilities to trajectory archetypes : empirically derived groups of students who share similar N1–N4 profiles and, consequently, similar mechanisms of vulnerability and response to support. Conceptually, CAPIRE reframes the question: from “How likely is this student to withdraw?” to “Which trajectory pattern is this student following, and what typically happens to students on this path?” Archetypes are obtained through unsupervised learning in the VOT-compliant feature space, combining dimensionality reduction (UMAP) with density-based clustering (DBSCAN). Each archetype is then characterised along three axes: Structural profile : distributions of N1–N2 features (e.g., socio-economic background, age at entry). Curricular and friction profile : N3 patterns (e.g., high IFC concentrated in core mathematics courses). Temporal profile : N4 dynamics (e.g., early gaps, late deceleration, high entropy). This structure enables: Interpretability : archetypes can be summarised in natural language as recognisable patterns (e.g., “early structural overload in gateway mathematics”). Heterogeneity-aware risk : each archetype has its own attrition rate and typical progression pattern. Actionability : archetypes map onto specific institutional levers (e.g., strengthened tutoring in particular courses, targeted financial advice, adjustments to curriculum sequencing). The term “archetype” is used in a pragmatic sense: not as an essentialist label, but as a recurring configuration that simplifies complexity without erasing relevant variation. 3.4. CAPIRE within institutional decision-making cycles CAPIRE is conceived as a sociotechnical system embedded in routine institutional processes rather than as a standalone predictive tool. Figure 3.2 summarises its role within decision-making cycles. In a typical deployment: Data integration and feature extraction : SIS and LMS data are periodically ingested, cleaned, and transformed into N1–N4 features under a configured VOT. Archetype assignment : students are assigned to archetypes based on their current feature vectors, with risk and mechanism profiles updated over time. Advisory use : academic advisors, programme coordinators, or retention committees access dashboards that display archetype distributions, key features, and historical outcomes. Intervention design : archetype profiles inform targeted actions (e.g., small-group tutoring, mentoring schemes, counselling referral paths), including the intensity, timing, and modality of support. Feedback and learning : outcomes of interventions feed back into subsequent training cycles, allowing institutions to monitor whether archetype distributions and associated risks change over time. Several design choices are essential for responsible integration: Human-in-the-loop : archetype assignments are recommendations, not prescriptions. Staff can override or re-interpret assignments based on qualitative knowledge that lies outside administrative data. Transparency : explanations are available at both group and individual level, enabling staff to understand why a student was mapped to a given archetype. Ethical framing : archetype labels are used internally by staff; communication with students focuses on supportive offers (e.g., “we have observed difficulties in specific courses and can provide tailored support”) rather than categorical classifications. Iterative refinement : as institutional policies, curricula, and external contexts evolve, the CAPIRE pipeline can be recalibrated and re-trained, preserving alignment with local realities. By anchoring analytics in interpretable archetypes, CAPIRE transforms predictive models into institutional learning tools, supporting a move from ad-hoc interventions to systematically designed, evidence-informed retention strategies. 4. MULTILEVEL FEATURE ENGINEERING IN CAPIRE This section translates the conceptual CAPIRE framework into a concrete feature dictionary: a set of 44 empirically tested variables spanning levels N1–N4. We describe the underlying data model, the construction logic for each feature family, and the design criteria that allow other institutions to adapt the framework to their own contexts while preserving temporal validity and interpretability. 4.1. Data model and entities CAPIRE operates on a relational data model with five core entities, common to most student information systems: STUDENT : one record per individual. ENROLMENT : one record per student–course–term combination. COURSE : curricular units with associated metadata (e.g., department, level, credits). CURRICULUM : programme-specific course sequencing and recommended load. SEMESTER/TERM : temporal index enabling alignment with macro-level indicators. In this model: A STUDENT has many ENROLMENT records. Each ENROLMENT references one COURSE . Each COURSE is associated with one CURRICULUM . Each ENROLMENT belongs to one SEMESTER , which is linked to calendar time (for N4 features). The outcome variable is a binary attrition flag: attrition_flag = 1 if the student leaves the programme without graduating within a predefined horizon (e.g., six years). attrition_flag = 0 if the student graduates or remains enrolled at the end of the horizon. Crucially, this outcome is never used as a feature . All predictors are computed from data that are available strictly up to the chosen Vulnerability Observation Time (VOT), ensuring that no post-hoc information leaks into the feature space. 4.2. N1 features: pre-entry socio-economic context Purpose. N1 features capture structural conditions that shape the resources, expectations, and constraints students bring into higher education. Typical N1 variables include: Neighbourhood deprivation (e.g., NBI_localidad) : an index derived from census data at the postcode or census-tract level, summarising poverty, overcrowding, access to basic services, and educational infrastructure. Distance to campus : geodesic or travel distance from the student’s home area to the institution, used as a proxy for commuting burden and integration costs. Local labour-market indicators at entry (desempleo_zona_t0, informalidad_zona_t0, pobreza_zona_t0) : unemployment, informality, and poverty rates for the student’s locality in the year of enrolment. Family educational capital (nivel_educ_padres, hermanos_universidad) : parental education and whether siblings have attended university. Secondary-school type (tipo_secundaria) : public vs. private vs. technical, as a coarse indicator of prior institutional context. Where necessary, N1 features are complemented by simple interaction terms (e.g., deprivation × pass rate) to capture how socio-economic context modulates academic outcomes. Missing data handling. Census-derived variables are typically complete at area level; when gaps exist, median imputation within region/province is used. For self-reported data (e.g., parental education), CAPIRE explicitly encodes missingness with separate binary indicators to avoid silently conflating “unknown” with any substantive category. Adaptation. Outside Country Q, analogous indices (e.g., US Census tract data, Index of Multiple Deprivation in the UK) can be substituted without altering the overall N1 logic. 4.3. N2 features: entry moment characteristics Purpose. N2 features describe the student at, and immediately around, the point of first enrolment. They are conceptually and temporally anchored at \(\:{t}_{0}\) (entry). Core N2 variables include: Age at entry (edad_ingreso) : a continuous measure that differentiates traditional and non-traditional entrants. Gender and other demographic flags : used primarily for fairness monitoring and descriptive analysis. Employment status at entry (trabaja_al_ingreso) : when available, indicates potential time and cognitive constraints. Upper-secondary performance (promedio_secundaria) : high-school GPA or equivalent exam score. Macro-economic conditions at entry (IPC_interanual_t0, strikes in the 24 months prior to \(\:{t}_{0}\) ) : inflation and major disruptions to schooling, aligned to the calendar year of enrolment, not averaged across cohorts. All N2 features are computed using data that are, by definition, available at the moment of first enrolment. This ensures that the observation window is properly anchored and that N2 plays the role of initial conditions in subsequent analyses. 4.4. N3 features: academic performance and curricular friction Purpose. N3 features capture how students engage with the curriculum within the VOT window. They describe both what students have attempted and how those attempts have unfolded. Typical N3 variables include: Volume and outcomes of coursework : total number of courses attempted up to VOT, total passed, failed, and dropped, pass and failure rates within the window, mean and median grades, along with variability measures. Curricular friction indicators : the Instructional Friction Coefficient (IFC) at course level, defined as a weighted combination of withdrawal and failure rates; the student-level mean IFC across all attempted courses; exposure to “filter” or “gateway” subjects—courses whose IFC exceeds a pre-specified threshold. In the FACULTY B-UNIVERSITY X case study, filter courses include high-impact mathematics and physics subjects that historically concentrate failure and withdrawal. Formally, for each course \(\:j\) , $$\:{\text{IFC}}_{j}={w}_{1}\cdot\:\frac{{\text{Dropped}}_{j}}{{\text{Attempted}}_{j}}+{w}_{2}\cdot\:\frac{{\text{Failed}}_{j}}{{\text{Attempted}}_{j}},$$ with default weights \(\:{w}_{1}=1.0\) and \(\:{w}_{2}=0.5\) , so that withdrawals are treated as a stronger signal than failures. For student \(\:i\) , the aggregated friction is $$\:{\text{IFC}}_{\text{mean}}^{\left(i\right)}=\frac{1}{\mid\:{C}_{i}\mid\:}\sum\:_{j\in\:{C}_{i}}{\text{IFC}}_{j},$$ where \(\:{C}_{i}\) comprises all courses attempted by student \(\:i\) before the VOT cut-off. VOT compliance. All N3 features are computed from ENROLMENT records whose dates fall within the interval \(\:[{t}_{0},{t}_{0}+{T}_{V}]\) . Courses taken after this window are invisible to the feature extractor, even if they are present in the database, thereby preventing temporal leakage. 4.5. N4 features: trajectory dynamics Purpose. N4 features encode the temporal structure of each student’s progression. Rather than summarising only coUniversity Xs and averages, they describe how events are distributed over time. Key N4 variables include: Average attempts per course (intentos_promedio_ventana) : distinguishing between students who typically pass on first attempt and those who accumulate retries. Maximum gap between enrolments (gap_maximo_entre_cursadas) : the longest period without active course participation, measured in terms or semesters. Trend in course load (tendencia_carga) : the slope of a simple regression of courses-per-semester on semester index, indicating whether students accelerate, maintain, or gradually reduce load. Velocity of advance (velocidad_avance) : the ratio between completed courses and the number expected by the nominal curriculum at VOT. Regularity of progression (regularidad_cursado) : variability in the spacing of enrolments. State entropy (entropia_de_estados) : diversity in course outcomes (passed, failed, dropped, not attempted) up to VOT, computed as Shannon entropy $$\:H=-\sum\:_{s\in\:S}p\left(s\right){\text{l}\text{o}\text{g}}_{2}p\left(s\right),$$ where \(\:S\) is the set of states and \(\:p\left(s\right)\) their empirical frequencies. High entropy indicates erratic trajectories (mix of passes, failures, and withdrawals), whereas low entropy reflects consistent patterns (predominantly success, or predominantly failure/withdrawal). 4.6. Interaction features and composite indicators Educational processes are rarely additive. The impact of academic friction depends on socio-economic context; the impact of age depends on patterns of enrolment and gaps. Linear main-effects-only models systematically miss such conditional structures. CAPIRE therefore incorporates a curated set of interaction features and composites, guided by theory and validated empirically. Examples include: Friction × withdrawal rate \(\:{\text{IFC}}_{\text{mean}}^{\left(i\right)}\times\:\text{tasa_libre}\) : captures compound academic risk when students repeatedly drop courses with high structural difficulty. Age at entry × attempts \(\:\text{edad_ingreso}\times\:\text{intentos_promedio}\) : models the idea that older students may be less resilient to repeated failure due to higher opportunity costs. Deprivation × pass rate \(\:\text{NBI_localidad}\times\:\text{pass}\ \text{ratio}\) : represents how poverty may amplify the consequences of academic setbacks. Filter exposure × maximum gap \(\:\text{exposicion_filtros}\times\:\text{gap_maximo}\) : reflects the vulnerability of students who both face demanding subjects and experience interruptions. These interactions are intentionally limited in number—focusing on theoretically plausible combinations—to avoid combinatorial explosion and overfitting. In our empirical models, they accoUniversity X for a disproportionately large share of predictive gain relative to their number, reinforcing the importance of multilevel thinking. 4.7. Feature dictionary and design criteria The complete CAPIRE feature dictionary comprises 44 variables: N1 : 12 pre-entry features (structural and socio-economic). N2 : 6 entry-moment features (demographics, preparation, macro-context). N3 : 16 curricular and performance features (including IFC-based metrics and course-specific indicators). N4 : 10 temporal and interaction features capturing dynamics and cross-level effects. Feature inclusion follows five design criteria: Temporal validity : all features must be computable using only data available at or before the VOT cut-off. Theoretical grounding : each feature must be linked to established retention or stratification theories (e.g., Tinto, Bourdieu, Astin). Actionability : features should inform potential interventions (e.g., high IFC flags subjects suitable for pedagogical redesign). Measurability : variables must be obtainable from standard institutional systems (SIS, LMS, census or official statistics). Non-redundancy : highly collinear candidates are pruned to maintain a compact, interpretable set. Conversely, several commonly used variables in the EDM literature are explicitly excluded when they violate these principles—most notably cumulative GPA computed post-hoc, total semesters enrolled (which is tautologically linked to attrition), or metrics requiring knowledge of end-of-trajectory outcomes. By enforcing these criteria, CAPIRE provides a feature space that is not only predictive, but also temporally honest, theoretically interpretable, and portable across institutions willing to adopt a similar multilevel, leakage-aware approach. 5. EARLY OBSERVATION WINDOW (VOT) AND DATA LEAKAGE PREVENTION 5.1. Defining the Value of Observation Time (VOT) The Value of Observation Time (VOT) is a central design parameter in CAPIRE. Intuitively, the VOT is the latest point in a student’s trajectory at which: The institution can still intervene in a meaningful way (e.g., tutoring, curricular adjustments, financial or psychosocial support), and All data used for risk profiling and archetype assignment are guaranteed to be available in a real operational setting. Formally, let \(\:t\) denote academic time measured in terms (or equivalent periods), and let \(\:T\) denote the end of the programme’s nominal duration. The VOT, \(\:{t}_{\text{VOT}}\) , satisfies: $$\:0<{t}_{\text{VOT}}<T$$; Institutional interventions launched at or shortly after \(\:{t}_{\text{VOT}}\) can plausibly affect completion; All features \(\:{X}_{\le\:{t}_{\text{VOT}}}\) are computable using information recorded on or before \(\:{t}_{\text{VOT}}\) . In many long-cycle programmes, a natural candidate for \(\:{t}_{\text{VOT}}\) is the end of the first academic year, which frequently corresponds to a peak in vulnerability and dropout. CAPIRE does not, however, hard-code this choice. Instead, institutions select VOT based on: empirical attrition curves (cumulative dropout by term or credit band); organisational response capacity (how quickly support services can act); curricular structure (timing of gateway or high-friction subjects). Once \(\:{t}_{\text{VOT}}\) is defined, the feature dictionary is partitioned into: VOT-admissible features : available at or before \(\:{t}_{\text{VOT}}\) and eligible for early-warning and archetype profiling; Post-VOT features : potentially useful for retrospective analyses, longitudinal research, or causal evaluation, but not for models claiming to operate at \(\:{t}_{\text{VOT}}\) . This explicit temporal boundary replaces vague formulations such as “early prediction” with a precise, auditable design constraint. 5.2. Temporal slicing of trajectories and label assignment Given a chosen VOT, CAPIRE adopts a two-axis temporal scheme: A trajectory axis , along which features are accumulated up to \(\:{t}_{\text{VOT}}\) ; An outcome horizon , beyond \(\:{t}_{\text{VOT}}\) , over which completion and dropout outcomes are defined. For each student \(\:i\) , we construct: A feature snapshot at VOT : $$\:{\mathbf{x}}_{i}^{\left(\text{VOT}\right)}=f({\text{N1}}_{i},\text{}{\text{N2}}_{i,\le\:{t}_{\text{VOT}}},\text{}{\text{N3}}_{i,\le\:{t}_{\text{VOT}}},\text{}{\text{N4}}_{i,\le\:{t}_{\text{VOT}}}),$$ where \(\:f(\cdot\:)\) denotes the feature-construction rules described in Section 4 . A label \(\:{y}_{i}\) , defined on a later interval, for example: \(\:{y}_{i}=1\) if the student drops out at any point before \(\:T+{\Delta\:}\) , where \(\:{\Delta\:}\) is a grace period; \(\:{y}_{i}=0\) if the student completes within that window; or a multi-class or time-to-event label (e.g., on-time completion, delayed completion, non-completion). By construction, no component of \(\:{\mathbf{x}}_{i}^{\left(\text{VOT}\right)}\) may depend on events after \(\:{t}_{\text{VOT}}\) . This constraint is enforced at two levels: Feature engineering : all queries and transformations include explicit temporal conditions (e.g., “up to and including term \(\:{t}_{\text{VOT}}\) ”), often implemented as time-filtered views of enrolment and assessment tables. Model evaluation : data splitting respects temporal structure. Training and test sets are separated by cohort or time, and preprocessing steps (scaling, encoding, feature selection) are fitted solely on the training partition within each fold. CAPIRE supports several temporal strategies: Single-shot early warning : one snapshot per student at a specific VOT (e.g., end of year 1). Rolling-window warnings : repeated VOT snapshots (e.g., after each term), enabling dynamic monitoring of risk and possible transitions between archetypes. Retrospective trajectory analysis : full student–term sequences with labels attached at the end of the observation period, suitable for survival or transition modelling in later CAPIRE work. In all cases, the strict separation between observation window and outcome horizon provides a clear framework for reasoning about leakage, stability, and fairness. 5.3. Typical leakage scenarios in dropout prediction In the absence of explicit temporal design, leakage often enters attrition models in subtle ways. CAPIRE explicitly identifies several recurrent patterns: Outcome-proximal academic features Using end-of-year or end-of-programme indicators—such as final GPA, total failed courses, or “ever dropped” flags—as predictors in models that purport to provide early warnings. Similarly, using statistics computed over the entire trajectory (e.g., maximum consecutive inactive terms) when the prediction is supposed to occur much earlier. Temporal aggregation without windowing Constructing features such as “total enrolled terms” or “time since first enrolment” from the full record, which implicitly reveals whether the student persisted or left. Likewise, computing mean LMS activity across all courses ever taken and using it as an “early” predictor. Label-dependent feature construction Creating variables that directly encode or closely proxy the outcome, such as “difference between expected and realised completion time” or flags indicating that the student ceased enrolling before degree completion. Preprocessing leakage across time or folds Fitting scalers, encoders, or feature selectors on the entire dataset—including future cohorts—before splitting into training and test sets, or using target-encoding schemes that inadvertently peek at labels in the validation fold. Cohort and policy-regime leakage Mixing cohorts that experienced different policies or macro-contexts in ways that allow models to infer outcomes from regime identifiers that are not available (or stable) at deployment for new cohorts. Some of these issues (e.g., incorrect scaling procedures) can be mitigated with rigorous pipeline implementation. Others—especially those involving outcome-proximal features—must be addressed at the feature-design and temporal-modelling level. CAPIRE is explicitly built to operate at that level. 5.4. How CAPIRE’s feature engineering prevents leakage Leakage prevention is embedded in CAPIRE’s design through four complementary mechanisms: Temporal eligibility tags in the feature dictionary Each feature is annotated as: VOT-admissible (eligible for early-warning and archetype models), post-VOT (restricted to retrospective or explanatory analyses), or restricted (requiring special justification or anonymisation). This enables independent auditing of temporal legitimacy at the feature level. Explicit VOT filters in construction rules : Feature formulas are written to include time bounds by design (e.g., “coUniversity X of failed core courses up to and including term 2”, “velocity of advance at VOT”). Implementation templates (SQL, Python, R) systematically incorporate conditions such as term < = t_VOT, reducing the risk of developers inadvertently crossing the temporal boundary. Cohort- and time-aware data splitting : For predictive tasks, CAPIRE favours cohort-based or time-based splits—training on earlier cohorts and testing on later ones—over random splits. Preprocessing (scaling, encoding, feature selection) is fitted exclusively on the training partition in each fold and then applied to validation/test sets, preventing information from flowing “backwards in time”. Design rules forbidding outcome-proximal features at VOT : For early-warning models, the framework explicitly forbids: use of any grade or course outcome recorded after \(\:{t}_{\text{VOT}}\) ; features aggregating over the entire enrolment history (e.g., total failed courses, total inactive terms); indicators derived from final status (graduate vs. dropout) or closely related proxies. When such variables are valuable for retrospective explanation (e.g., for case studies or causal analyses), they are computed in clearly separated post-VOT feature sets that cannot be accidentally incorporated into VOT-based models. In addition, CAPIRE encourages diagnostic checks for possible leakage, such as comparing performance under random vs. time-based splits, and benchmarking against models trained with explicitly post-VOT features. Large discrepancies in performance can signal hidden temporal leakage and trigger further inspection. Rather than treating leakage as an incidental implementation problem, CAPIRE elevates it to a first-order design constraint. 5.5. Generalising VOT to other programmes and modalities Although the implementation discussed in this paper concerns a multi-year engineering programme, the VOT concept and associated design rules generalise to a wide range of educational settings: Short-cycle and professional programmes In two-year or shorter programmes, VOT may be defined at the end of the first major assessment block or when a given proportion of credits (e.g., 25–30%) has been attempted. Features then focus on the earliest robust indicators of friction and pacing. Modular, competency-based, and micro-credential systems Where progression is organised in modules or competencies rather than fixed terms, VOT can be defined as the point where a student accumulates a specified number of modules or attempts. Temporal slicing then operates over module sequences, and features summarise early module completion patterns, retries, and idle periods. Online, blended, and MOOC-like environments VOT may be set in terms of weeks since registration or proportion of content accessed. Leakage prevention requires excluding engagement metrics that implicitly look beyond this cut-off (e.g., final exam participation), while including early engagement signals (first-week activity, initial assessment performance). Part-time and non-traditional trajectories For heterogeneous pacing, VOT is better expressed in terms of attempted or completed credits rather than elapsed calendar time (e.g., “after the student has attempted 40 credits”). This avoids penalising slower but still viable trajectories. Cross-institutional or system-level analytics For comparative studies, VOT can be standardised in terms of relative progression (e.g., completion of the first curricular block) rather than absolute years. Each institution then maps this conceptual VOT to local structures (terms, modules). Across these modalities, the core logic of VOT remains unchanged: Identify a point at which intervention remains meaningful; Restrict the feature set to data legitimately available by that point; Make these restrictions explicit and auditable. By elevating VOT from an informal intuition (“early enough”) to a formal design parameter, CAPIRE provides a reusable template for building early-warning and archetype-discovery systems that are both accurate and temporally honest. 5.6. Sensitivity analysis for DBSCAN noise cases In addition to the main UMAP + DBSCAN clustering workflow, we conducted a sensitivity analysis of the cases labelled as noise (cluster = − 1) by DBSCAN. This was motivated by a known limitation of density-based clustering methods: sparse but meaningful minority structures may be incorrectly classified as noise in low-dimensional embeddings. The analysis proceeded in two stages. First, we compared outlier students with non-outlier students across a set of theoretically grounded N2–N4 indicators (e.g., age at entry, VOT-window mean IFC, maximum gap between enrolments) using descriptive statistics and non-parametric tests (Mann–Whitney U and Levene’s tests). This allowed us to assess whether the noise group exhibited high internal heterogeneity (as would be expected for genuine noise) or instead formed a coherent pattern. Second, we performed dedicated re-clustering of the outlier subset using algorithms such as k -means, hierarchical clustering, and HDBSCAN. The results (reported in detail in Section 7.4.5 ) show that the DBSCAN noise group contains at least two well-separated minority configurations with high internal cohesion, contradicting the interpretation of these cases as unstructured residuals. This sensitivity analysis strengthens the transparency and ecological validity of the clustering pipeline. It demonstrates that CAPIRE does not simply discard a quarter of the cohort as opaque noise but explicitly documents and interrogates the structure of these cases, responding directly to peer-review concerns about representativeness and coverage. 6. IMPLEMENTATION AND PIPELINE ARCHITECTURE CAPIRE is implemented as a modular, reproducible pipeline with strict temporal validation, designed to transform heterogeneous institutional data into feature matrices ready for topological analysis, archetype discovery, and predictive modelling. The architecture prioritises three fundamental principles: reproducibility, traceability, and absolute prevention of data leakage. The result is a system capable of operating from ad-hoc analytical environments to automated, institution-scale deployments. 6.1. Overview of CAPIRE-Core Architecture The architecture of CAPIRE-core follows a separation of responsibilities pattern, where each module operates independently and is verifiable. The pipeline is organised into four macro-layers: Configuration Layer : Centralises all analytical decisions outside the code. Defines temporal window parameters, activation of feature levels, validation rules, imputation strategies, and weights for synthetic indices. Data Ingestion & Validation : Establishes connectors capable of extracting data from SIS, LMS, administrative files, and macroeconomic sources. Each dataset undergoes structural, referential, and temporal validation before entering the pipeline. Feature Engineering Layer : Implements extractors by level (N1–N4). Each extractor applies VOT, generates derived transformations, and ensures that no attribute uses information after the temporal cutoff point. Assembly & Metadata Layer : Consolidates the final set of features, generates standardised artefacts (Parquet matrices, dictionaries, JSON sidecars), and documents each feature matrix with configuration hashes and temporal audit trails. Architectural principles : Idempotence : Any execution with the same configuration produces exactly the same results. Modularity : The N1–N4 extractors function as decoupled blocks; institutions without census data, for example, can deactivate N1 without affecting the rest of the pipeline. Traceability : Each artefact is versioned with its complete configuration and cryptographic hash. Early validation : Errors are detected at entry, never during modelling. Figure 6.1 summarises this modular design and the connections between layers. 6.2. From Raw Data to Feature Matrices: ETL Workflow The CAPIRE ETL pipeline follows a deterministic flow, composed of four critical stages: Stage 1: Data Ingestion (Extract) Institutional systems often store information in heterogeneous schemas (SQL databases, CSV exports, Excel spreadsheets). For this, CAPIRE-Core includes specific connectors that: estandarice fiel names; normalise date formats, identifiers, and postcodes; link student databases with census or macroeconomic data via geographic or temporal keys. The result is fully normalised, consistent, and comparable datasets across cohorts. Stage 2: Preprocessing & Validation (Transform) All information undergoes strict validation of: data types (numeric, date, categorical), plausible ranges (age, grades, rates), referential integrity (every enrolment must have a valid student), temporal consistency (no event can occur before the student’s entry), completeness rules (essential fields cannot be missing). If validation fails, the pipeline halts and generates an error report. Missing data management follows differentiated strategies for each level: N1 : geographic imputation + absence indicators, N2 : mean imputation + indicators, N3 : never imputed (absence is informative), N4 : temporal interpolation where appropriate. Stage 3: Feature Engineering (Transform) Each level of the multilevel model has a dedicated extractor: N1 : socioeconomic and demographic context; N2 : self-declared and transversal attributes; N3 : academic behavioural footprints (core of the framework); N4 : macroeconomic and cohort conditions. VOT enforcement is integrated into each extractor: no feature may use data after the defined temporal window. This guarantees total absence of data leakage, even if developers incorporate new variables. It also includes calculation of explicit interactions between levels (e.g., NBI × pass rate), and derivation of synthetic indices such as the IFC. Stage 4: Feature Matrix Assembly (Load) The pipeline output is a compressed feature matrix in columnar format, accompanied by a metadata file documenting: number of features, percentage of missing values, extraction configurations, full configuration hash, exact execution timestamp, included cohorts and final sample size. This mechanism makes it possible to reproduce any historical matrix with bit-level precision. 6.3. Reproducibility and Configuration Management CAPIRE-Core requires that all analytical decisions be external to the code and audited via: YAML configuration files (define pipeline parameters), JSON validation schemas (define structural rules), SHA-256 cryptographic hashes (uniquely identify each configuration). The combination of these three elements ensures that: any pipeline can be regenerated. any analytical error can be traced to its specific configuration. institutional teams work with a versioned, auditable, and comparable system across years. The system makes CAPIRE a scientifically robust and standardised tool, aligned with the reproducibility requirements of Q1 journals. 6.4. Computational Considerations and Scalability The pipeline is designed to be efficient on modest hardware and scalable in institutional environments. Medium-sized datasets (≈ 1,300 students) are fully processed in less than a minute. It scales almost linearly with the number of students and courses per student thanks to: batch processing, columnar reading, optional parallelisation by student, optimised IFC calculations. For large institutions (> 100,000 students), parallel processing and columnar storage are recommended. The pipeline can be integrated with distributed systems if the institution has greater infrastructure. 6.5. Deployment Modes CAPIRE-Core supports three deployment modes, according to the institution’s technological maturity: Mode 1 — Batch Processing (Entry-Level) : Analysts run the pipeline manually at the end of the semester. Ideal for planning offices with minimal infrastructure. Mode 2 — Scheduled Automation (Intermediate) : The pipeline runs on a scheduled basis (e.g., weekly), with direct access to SIS/LMS. Used for quarterly early warning systems. Mode 3 — Real-Time Integration (Advanced) : CAPIRE operates as a microservice queried by the student information system. Provides archetype, risk, and recommendation in real time when opening the student’s record. Current situation: FACULTY B–UNIVERSITY X operates in Mode 2, with migration to Mode 3 planned for 2026. 6.6. Quality Assurance and Testing Quality assurance follows a pyramidal approach: Unit tests : Verify internal calculations of extractors. Integration tests : Ensure the complete pipeline produces valid matrices. Validation tests : Confirm strict compliance with VOT. Regression tests : Compare new matrices with historical matrices to ensure reproducibility. The critical test is temporal validation: it is verified that no feature uses data after the cutoff. Tests are run automatically via continuous integration. 6.7. Software Availability and Licensing CAPIRE-core will be released as open software under the MIT licence, with: public repository, complete documentation and tutorials, synthetic dataset for validation, extensible modules for new feature types and new connectors. Institutional contribution is encouraged to extend the ecosystem, especially in: specific extractors for online modalities, new LMS connectors (Canvas, Blackboard), advanced behavioural features, longitudinal monitoring pipelines. 7. EMPIRICAL ILLUSTRATION: STUDENT TRAJECTORY ARCHETYPES AT UNIVERSITY X 7.1. Institutional Context and Dataset 7.1.1. Institutional Setting The empirical illustration of CAPIRE was conducted at the Facultad de Ciencias Exactas y Tecnología of Universidad Nacional de Region Z (FACULTY B-UNIVERSITY X), a public engineering school in northwest Country Q. FACULTY B-UNIVERSITY X offers six undergraduate engineering programs; this study focuses on Civil Engineering, a traditional program characterized by: Sequential prerequisites : Long chains of dependent courses in which progress in advanced subjects is strictly conditioned on completion of foundational courses. High mathematical rigor : First-year subjects such as Calculus I–III, Physics I–II and Linear Algebra act as “filter courses” with historically high failure and withdrawal rates. Socioeconomically diverse intake : Most students come from middle- and lower-income households; between 35% and 40% work while studying. Open admission : In line with many Latin American public universities, there is no entrance examination; all high-school graduates are admitted. These features make FACULTY B-UNIVERSITY X broadly representative of public engineering institutions in Latin America facing structural challenges in retention and time-to-degree (Giovagnoli, 2002; García de Fanelli, 2014). 7.1.2. Dataset Description The empirical sample comprises 1,343 Civil Engineering students from the 2004–2019 cohorts, covering 15 academic years. The analytical dataset integrates the four CAPIRE levels: N1 – Pre-entry structural context : demographic variables (age at enrolment, place of origin), postal-code–linked neighborhood deprivation indices, and local labour market indicators. N2 – Entry moment : high-school GPA, employment status at enrolment, and prior educational trajectory. N3 – Academic performance : course enrolments, pass/fail outcomes and exam attempts during the observation window. N4 – Trajectory dynamics : temporal ordering of course attempts, gaps between enrolments and changes in course load over time. The Value of Observation Time (VOT) was set to \(\:{T}_{V}=1.5\:\) years (end of the second academic year). All features were constructed using only information available up to \(\:{T}_{V}\) , in strict compliance with the leakage-prevention principles described earlier. Full-trajectory outcomes (attrition vs. graduation, time-to-degree) were reserved solely for ex-post evaluation and were never used in feature construction. 7.1.3. Descriptive Statistics At \(\:{T}_{V}\) , the average age at enrolment was 18.7 years (SD = 1.7), with women representing 18.2% of the sample and 3.9% of students reporting employment at entry. Trajectories in the first 1.5 years are already fragile: students attempt close to the nominal first-year course load, but a large fraction of attempts end in failure or “libre” (dropping the course without taking the exam). Over the full trajectory, the attrition rate reaches 56.7% , the graduation rate 14.8% , and the mean time-to-degree is 7.2 years, substantially exceeding nominal program length. Missing data are concentrated in two blocks: (1) macro-economic indicators for rural postal codes (≈ 28% missing), and (2) grade-based metrics for students who drop all courses without sitting exams (≈ 42% missing in those variables). We combined median imputation for selected N1 features, exclusion of grade-based variables from the clustering step, and explicit missingness indicators. Missingness pattern analysis (Little’s test) did not detect systematic associations between missingness and attrition, supporting the assumption that missingness does not bias the archetype discovery. 7.2. Feature Engineering Implementation The CAPIRE multilevel feature dictionary (Section 3 ) was operationalized to produce 44 features grouped across four levels. N1 – Structural context (12 features). These variables capture socioeconomic vulnerability via a neighborhood deprivation index (NBI), local unemployment and informality rates at enrolment, and indicators of macro-economic crisis periods. Interaction terms such as NBI × pass rate link structural disadvantage to observed performance. N2 – Entry moment (6 features). Features include age at enrolment, employment status, geographic origin (rural vs. urban; distance to campus), and temporally aligned educational and economic context (e.g., number of teacher strikes in the 24 months preceding enrolment, inflation at t₀). N3 – Academic performance snapshot (16 features). Up to \(\:{T}_{V}\) , we summarize the academic record using coUniversity Xs of failed courses, proportion of “libre” enrolments, mean and median grades, and variability of performance. A central construct is the Instructional Friction Coefficient (IFC) , which quantifies course-level structural difficulty by combining failure and withdrawal rates and allows identification of institutional “chokepoint” courses. N4 – Trajectory dynamics (10 features). These variables describe temporal patterns such as the maximum gap between consecutive enrolments, the trend in course load across semesters, the ratio of completed to expected courses at \(\:{T}_{V}\) , several cross-level interaction terms (e.g., friction × dropout, age × re-enrolment) and an entropy-like index capturing how erratic or consistent the sequence of states (passed/failed/dropped/not attempted) is over time. All features strictly respect VOT compliance : no variable uses information beyond 1.5 years after enrolment; macro-indicators are aligned with the year of entry; and the attrition label is never used for feature construction. Configurations are versioned so that the exact feature set can be regenerated. 7.3. Archetype Discovery Results 7.3.1. Dimensionality Reduction and Clustering Given the 44-dimensional feature space, we first applied Uniform Manifold Approximation and Projection (UMAP) to obtain a three-dimensional embedding that preserves local structure while facilitating clustering. The resulting representation captures slightly more than half of the total variance and provides a well-separated manifold suitable for density-based clustering. We experimented with Mapper-based TDA using multiple lenses and cover parameters, but Mapper consistently produced dozens of micro-clusters, many too small to support institutional interventions. This mismatch reflects a tension between fine-grained topological exploration and the need for a limited number of robust, interpretable types. We therefore adopted a more pragmatic strategy: DBSCAN applied directly to the UMAP embedding. DBSCAN hyperparameters were tuned using k-distance plots and cluster validity indices. The final solution yielded 18 clusters , of which 13 met our interpretability criterion (≥ 40 students) and were retained as archetypes. Smaller clusters were merged with density-labelled noise for analysis. Overall, 847 students (63.1%) received a stable archetype label; 356 (26.5%) were classified as noise; and 140 (10.4%) belonged to small clusters merged into the residual group. Cluster validity was acceptable for a heterogeneous educational dataset: the silhouette coefficient was 0.318, the Calinski–Harabasz index 590.4 and the Davies–Bouldin index 0.702, all consistent with well-separated yet overlapping clusters in a complex social system. 7.3.2. Archetype Characterization Each archetype was profiled using descriptive statistics of the 44 features plus full-trajectory outcomes (attrition and graduation). Table 7.2 (not reproduced here in full) summarizes the five largest archetypes. Key patterns include: Arquetipo 5 – High-Risk: Sustained Friction. Students with high and persistent curricular friction: around three failed or dropped courses within \(\:{T}_{V}\) , dropout rates near 75%, and IFC values among the highest across Q1–Q4. Attrition reaches 74.3%, with very low graduation. These students are structurally embedded in “chokepoint” courses and require intensive academic support. Arquetipo 2 – Moderate-Risk: Extra-Academic Factors. Students with relatively low friction (low “libre” proportion and moderate failure rates) but still high attrition (≈ 59%). Their trajectories suggest that withdrawal is driven less by academic failure and more by unobserved extra-academic pressures (financial stress, health, family obligations), indicating the need for counseling and social support rather than purely curricular interventions. Arquetipo 9 – Critical-Risk: Total Disengagement. Students whose entire first-year record consists of dropped courses (100% “libre”) and virtually no exams taken. Attrition exceeds 80%. These students never establish an academic foothold and would benefit from pre-enrolment orientation, realistic expectation-setting and first-weeks intensive support. Arquetipo 16 – Low-Risk: Success Model. Students with consistently low friction, high pass rates, no significant gaps and early completion. Attrition is about 21% and graduation above 27%. They represent “success trajectories” and are natural candidates for peer-mentoring roles and for defining normative curricular benchmarks. Arquetipo 0 – Moderate-Risk: Young Strivers. The youngest group on average, with no employment at entry but high friction in early courses and elevated attrition (≈ 66%). They appear academically motivated but underprepared for the level of rigor, suggesting the value of bridge programs and explicit training in study strategies. Table 7.1 Summary of the five largest archetypes (N = 1,343). Archetype ID Archetype Label N1–N2 Profile N3 Friction Pattern N4 Trajectory Pattern Attrition Rate (%) Arquetipo_5 High-Risk: Early Performance Collapse Medio-bajo SES; ingreso estándar; edad levemente superior al promedio Bajo rendimiento inicial; alta tasa de libres; fuerte dependencia de materias básicas Trayectoria inestable; repetición temprana; riesgo persistente 74.3% Arquetipo_9 Moderate-Risk: Low GPA + Course Friction SES intermedio; ingreso tradicional Desempeño inicial bajo; alta proporción de desaprobaciones Oscilaciones moderadas; progresión lenta 84.5% Arquetipo_8 Low-Middle SES + Mixed Performance SES bajo; ingreso temprano; edad baja Rendimiento heterogéneo; mezcla de aprobaciones y libres Trayectoria zigzagueante pero no crítica 64.1% Arquetipo_0 Adult Entrants with High Friction Edad muy superior al promedio; empleo frecuente Notas bajas; dificultades en tramos iniciales Trayectoria fragmentada; interrupciones recurrentes 66.1% Arquetipo_11 Moderate-Risk: High Course Load + Low Success SES medio; estudiante típico Alta tasa de materias cursadas con rendimiento deficiente Progresión lenta con acumulación de deuda académica 71–72% (según z-score + estimación) A heatmap of standardized features across archetypes (Fig. 7.5 ) highlights sharp contrasts—for example, the extreme “libre” rates of Arquetipo 9 and the low IFC and high progress indicators of Arquetipo 16. 7.3.3. Filter Subjects and Curricular Friction At the course level, we computed the Instructional Friction Coefficient across Q1–Q4. The top ten “filter subjects” include advanced structural mechanics, hydrology, basic hydraulics, pavement design, upper-level calculus, statistics and key materials courses. Civil Engineering subjects dominate the friction ranking, with mathematics courses acting as cross-program barriers. From an institutional perspective, this confirms that attrition is not purely idiosyncratic: specific curricular components systematically generate friction. CAPIRE provides a quantitative map of those bottlenecks, which can be used to prioritize pedagogical redesign (e.g., active learning, peer-assisted instruction, changes in prerequisite structures). 7.4. Archetype Validation To ensure that the 13 archetypes represent genuine and robust patterns, we conducted several complementary validation analyses. 7.4.1. Bootstrap Stability Using 100 bootstrap resamples of the original dataset, we re-estimated the full UMAP + DBSCAN pipeline and compared cluster assignments via the Adjusted Rand Index (ARI). The mean ARI was 0.614 (SD = 0.081; 95% CI [0.444, 0.780]), indicating substantial stability in cluster structure despite sampling variability—particularly remarkable given the heterogeneity typical of student trajectories. 7.4.2. Permutation Significance Test To test whether the observed clustering outperforms random partitions, we built a null distribution of silhouette scores from 100 random permutations of cluster labels. The real silhouette score (0.318) was far above the null mean (− 0.122), with an empirical p-value of 0.0099. Thus, the observed clusters are highly unlikely to arise by chance (p < 0.01). 7.4.3. Temporal Validation Across Cohorts We assessed temporal stability by splitting the sample into two independent periods (2004–2010 and 2011–2019), projecting both through the same UMAP embedding and comparing archetype distributions and attrition rates. Differences in attrition per archetype were consistently below 5 percentage points. The overall attrition rate decreased modestly in later cohorts, likely reflecting institutional policies, but the relative profiles and risks of each archetype remained stable . This supports the interpretation of archetypes as persistent structural patterns rather than cohort-specific artefacts. 7.4.4. Sensitivity to Hyperparameters We explored the sensitivity of the archetypes to variations in UMAP and DBSCAN hyperparameters via a small grid of alternative configurations. Across 27 combinations, the ARI relative to the reference clustering averaged 0.74, with a minimum of 0.62 and a maximum of 0.89. This indicates that archetype structure is robust to reasonable changes in modelling choices and is not an artefact of a particular parameter setting. 7.4.5. Analysis of DBSCAN “Noise” Because DBSCAN labels a substantial fraction of students (26.5%) as noise, we analysed this group separately. Compared with clustered students, outliers had almost identical mean age and friction but shorter gaps between enrolments and lower variance in the analysed variables. Non-parametric tests (Mann–Whitney and Levene) confirmed statistically significant differences in distributions and lower dispersion among outliers. Re-clustering only the outliers revealed at least two clearly separated micro-archetypes with high silhouette scores, and additional smaller groups under HDBSCAN. This suggests that the “noise” does not constitute random chaos but rather cohesive minority trajectories that are not dense enough to form DBSCAN clusters. These residual structures deserve explicit modelling in future work. 7.5. Predictive Performance: Early-Warning System 7.5.1. Model Development To translate archetypes into an operational early-warning system, we trained a multiclass classifier to predict archetype membership at \(\:{T}_{V}=1.5\) years. The model uses only the feature set available at \(\:{T}_{V}\) , with 13 classes (one per valid archetype) and 847 labelled students (those assigned to archetypes). Outliers were excluded from training to avoid conflating majority patterns with minority residuals. A Random Forest model, tuned via stratified cross-validation, provided the best balance between accuracy and interpretability. The train–test split (70/30) preserved the proportion of each archetype. 7.5.2. Overall Performance On the held-out test set, the model achieved: Accuracy : 94.9% (95.7% on training; 94.1% ± 1.4% in cross-validation), Macro F1-score : 0.948, Small train–test gap , indicating minimal overfitting. Compared to baselines, performance is substantially higher: the majority-class baseline would reach only 8.1% accuracy, and random assignment ≈ 7.7%. The CAPIRE-based classifier thus improves predictive power by more than an order of magnitude, using only information available within the first 1.5 years—well before the average dropout time of 2.8 years. 7.5.3. Per-Archetype Performance Per-class F1-scores are uniformly high. High-risk archetypes (e.g., Arquetipo 1, 5 and 9) achieve F1 > 0.95, enabling reliable targeting of the most vulnerable students. The “success model” archetype (16) is also classified with perfect or near-perfect accuracy, making it feasible to systematically recruit exemplary students as mentors. Moderately risky archetypes show slightly lower but still strong performance (F1 ≈ 0.88–0.90), with confusions primarily between adjacent risk profiles rather than between high- and low-risk groups. No archetype falls below F1 = 0.70. 7.5.4. Feature Importance An analysis of feature importance confirms the multilevel nature of attrition mechanisms. The most predictive variables are: cross-level interactions such as curricular friction × dropout rate and age × re-enrolment attempts ; friction metrics in foundational courses; the proportion of dropped courses; trajectory-level indicators such as entropy of states, re-enrolment frequency and maximum gaps. Notably, purely structural N1 variables rarely appear among the top predictors. Their effect appears to be mediated through N2–N4 variables (e.g., socioeconomic disadvantage → need to work → higher “libre” rates and gaps). This aligns with the CAPIRE hypothesis that structural vulnerability operates through behavioural and temporal mechanisms rather than as a direct determinant. 7.5.5. Deployment and Impact Projection The trained classifier can be deployed as a back-end service in the institutional information system, assigning archetypes to students as soon as they reach \(\:{T}_{V}\) and triggering pre-defined, archetype-specific interventions (Section 7.6 ). A simple cost–benefit projection, assuming modest reductions in attrition (10–15 percentage points) for the most critical archetypes, suggests that targeted interventions could retain around 20 additional students per year . Over five years, this corresponds to roughly 100 additional graduates , increasing the overall graduation rate by approximately 13% relative to the baseline. Even under conservative assumptions about intervention costs and retained tuition, the net financial impact is positive, aside from reputational and social benefits. 7.6. Institutional Interpretability and Actionable Insights 7.6.1. Representative Case Studies To bridge statistical results with institutional experience, we constructed de-identified case vignettes for selected archetypes. A student in Arquetipo 5 exhibits repeated failures and withdrawals in filter subjects (Calculus, Physics), maintains enrolment for several semesters and then drops out. The profile combines moderate socioeconomic stress, part-time work and structurally high friction, pointing to early tutoring plus financial aid as plausible interventions. A student in Arquetipo 16 progresses linearly, passes all foundational courses on first attempt and graduates within six years. This trajectory exemplifies a success model, suggesting that such students can be systematically recruited as peer mentors and that their strategies can inform institutional best-practice guidelines. A student in Arquetipo 2 shows acceptable academic performance but withdraws following a family health crisis. Here, the data reveal a missed opportunity: the student was viable academically but lacked support in coping with life events. This points to the need for proactive counseling and emergency aid linked to sudden gaps in enrolment. These vignettes were presented to academic advisors and department heads, who consistently recognized the profiles and associated them with familiar categories (“chronic repeaters”, “good students lost to personal issues”, “exemplary students”). No archetype contradicted institutional experience, suggesting that CAPIRE’s data-driven segmentation is ecologically valid and complements practitioner knowledge. 7.6.2. Archetype-Specific Interventions Building on archetype profiles, we elaborated an intervention matrix that links each archetype to a priority level, dominant vulnerability and recommended institutional response. Critical-risk groups (e.g., Arquetipos 1, 5, 9) are associated with intensive tutoring in filter subjects, program redesign in high-friction courses and structured bridge programs. Moderate-risk groups (e.g., Arquetipos 0, 2) call for mentoring, study-skills training and strengthened psychosocial and financial support. Low-risk archetypes are not intervention targets but rather strategic resources (mentors, role models, benchmarks). The matrix also provides a staged implementation roadmap, beginning with pilots for a single archetype and progressively expanding towards full integration of CAPIRE in academic advising and institutional planning. 7.6.3. Alignment with Institutional Knowledge Qualitative feedback from 8 academic advisors and 3 department heads confirmed strong alignment between archetypes and existing informal categories used in advising. Interestingly, staff tended to overestimate the importance of pre-entry factors (N2) and underestimate trajectory dynamics (N4), illustrating common attribution biases: human observers focus on stable traits and neglect temporal processes. CAPIRE thus serves not only as a prediction tool but also as a conceptual reframing device , making dynamic mechanisms visible to institutional actors. 7.7. Discussion: Lessons from UNIVERSITY X Implementation 7.7.1. CAPIRE Framework Validation The FACULTY B-UNIVERSITY X case demonstrates that CAPIRE can: enforce strict temporal validity (VOT) and eliminate data leakage; discover a manageable set of interpretable archetypes recognized by practitioners; achieve statistically robust clustering (bootstrap and permutation tests); support highly accurate early prediction of archetype membership; translate predictions into a differentiated intervention portfolio; integrate multilevel features (N1–N4) in a single explanatory framework. 7.7.2. Methodological Innovations Confirmed Three methodological choices are particularly reinforced: UMAP + DBSCAN vs. Mapper for archetype discovery : While Mapper TDA is valuable for exploratory topology, the combination of UMAP and density-based clustering proved better suited for obtaining a small number of robust, institutionally actionable archetypes. Multilevel feature engineering and interactions : Cross-level interaction terms (N3×N4, N2×N4) contributed disproportionately to predictive performance, empirically supporting the CAPIRE view that outcomes emerge from interactions across levels rather than from isolated variables. VOT-based leakage control : Setting \(\:{T}_{V}=1.5\) years struck a practical balance: the classifier achieved almost 95% accuracy while preserving a lead time of roughly 1.3 years before the typical dropout event, making early intervention realistically feasible. 7.7.3. Limitations and Threats to Validity The study faces several limitations: Internal validity. Students who drop out before \(\:{T}_{V}\) cannot be fully observed; although sensitivity analyses with shorter VOT windows yield similar archetypes, some selection bias may remain. Self-reported data (e.g., work status) may underestimate informal employment. External validity. Results come from a single public engineering school in Country Q. Archetype structure might differ in private universities, non-STEM programs or other national systems, especially in more stable macro-economic contexts. Construct validity. Archetype labels (“high-risk”, “success model”) are heuristic and probabilistic; boundaries are fuzzy. The interpretation of curricular friction assumes that dropped courses mark structural barriers, although strategic withdrawals may also occur. Statistical conclusion validity. Multiple comparisons across features and archetypes increase the risk of false positives; however, the main conclusions rely on effect sizes, stability metrics and permutation tests rather than isolated p-values. Students labelled as DBSCAN noise represent a non-negligible minority whose trajectories require more refined modelling. 7.7.4. Comparison with Prior Attrition Models Compared with traditional regression-based approaches and more recent deep-learning models, the CAPIRE implementation at FACULTY B-UNIVERSITY X offers a distinct combination of properties: it relies solely on administrative data (no surveys), increasing scalability; it reaches higher or comparable predictive accuracy while maintaining explainability ; it enforces temporal validity, an often neglected aspect in education data mining; and it links predictions to explicit archetypes and intervention strategies , closing the loop between analytics and policy. In this sense, CAPIRE sits between theory-heavy but operationally vague models (e.g., Tinto’s integration framework) and highly predictive but opaque “black-box” models, providing a middle path of mechanistic, actionable explainability . 7.7.5. Practical Implications For university leadership, the results underscore the importance of investing in longitudinal data infrastructure and in differentiated support strategies aligned with archetype profiles. For researchers, they highlight the need to integrate multilevel feature engineering, topological tools and strict temporal validation. For policymakers, the findings emphasize that attrition is structurally heterogeneous and that segment-specific interventions are more efficient than uniform policies. 7.8. Conclusion of the Empirical Illustration The FACULTY B-UNIVERSITY X case study shows that CAPIRE can transform conventional administrative data into a coherent, multilevel map of student trajectories . The 13 archetypes identified capture 63.1% of students, remain stable across cohorts, are statistically robust and are recognised by institutional stakeholders. A leakage-aware classifier can assign students to archetypes with high accuracy at 1.5 years, providing a generous window for targeted intervention. This empirical illustration validates CAPIRE not only as a conceptual framework but as an operational blueprint for data-driven retention policies. The next section (Section 8 ) situates these findings within broader educational theory and discusses how the CAPIRE approach can be generalized and scaled to other institutional and national contexts. 8. DISCUSSION The empirical validation at FACULTY B-UNIVERSITY X (Section 7 ) shows that CAPIRE fulfils its foundational goals: leakage-free feature engineering, interpretable trajectory archetypes, and accurate early-warning predictions. In this section, we synthesize the main theoretical contributions, position CAPIRE vis-à-vis alternative approaches, and discuss implications for institutional practice, portability, and ethics. 8.1. Multilevel Feature Engineering: Theoretical and Empirical Validation 8.1.1. Interaction Effects as Primary Drivers A central claim of CAPIRE is that educational outcomes emerge from cross-level interactions , not from isolated main effects. The FACULTY B-UNIVERSITY X results support this claim: interaction features represent a minority of the feature set yet accoUniversity X for a disproportionate share of predictive importance. Interactions such as curricular friction × dropout behaviour (e.g., IFC × proportion of “libre” courses) and age at entry × number of retries encode person–context fit: the same institutional conditions (e.g., high-friction courses) have different consequences for older students with family or work responsibilities than for younger students with fewer constraints. This pattern aligns with life course theory (Elder, 1998) and ecological systems theory (Bronfenbrenner, 1979), both of which emphasize that development reflects the alignment between individual characteristics and layered contextual demands. Methodologically, this has two consequences: Models that ignore interactions (e.g., simple logistic regression without interaction terms) are structurally underpowered. Pre-computing theoretically motivated interactions, rather than relying solely on tree-based models to discover them implicitly, improves interpretability: features such as age × retries have a clear narrative interpretation that advisors can understand and use. 8.1.2. Trajectory Dynamics Rival Snapshot Performance Traditional early-warning systems often rely on static indicators such as GPA at a particular semester. CAPIRE adds N4 trajectory features that capture how students move through the curriculum: gaps, re-enrolments, entropy of states, and velocity of progress. Empirically, N4 features contribute nearly as much predictive power as N3 performance snapshots. Two students with similar GPA at \(\:{T}_{V}\) can belong to very different archetypes: one with linear, gap-free progress and another with repeated enrolments, mixed outcomes and long interruptions. The latter is far more likely to drop out, even if grades at a given point are comparable. This supports longitudinal perspectives (Singer & Willett, 2003) and shows that patterns over time contain crucial information beyond static performance. For practice, it implies that advisors should pay attention to how students progress, not just to what their current grades are. 8.1.3. Socioeconomic Context Operates Indirectly Despite the strong literature on socioeconomic barriers to persistence (Bourdieu, 1986 ; Lareau, 2011), N1 structural features have low direct importance in the predictive model. This does not refute socioeconomic theories; instead, it suggests an indirect, mediated role . High neighborhood deprivation (NBI) is associated with a higher probability of working while studying, which in turn is associated with a higher proportion of dropped courses and greater gaps in enrolment. These downstream N2–N4 variables, not N1 alone, are what directly drive attrition in the model. In causal terms, N1 functions as a distal determinant, shaping exposure to risk mechanisms further down the trajectory. Removing N1 from the feature set reduces overall performance, but its contribution is mostly channeled through mediating features rather than appearing as a top-ranked predictor on its own. For policy, this reinforces the idea that structural interventions (e.g., financial support that reduces the need to work long hours) are complementary to academic interventions: they operate upstream in the causal chain. 8.2. Advantages Over Black-Box and Theory-Free Approaches 8.2.1. Interpretability, Trust, and Institutional Adoption Compared with black-box models such as deep neural networks (Hu & Rangwala, 2020), CAPIRE offers a combination of high predictive accuracy and high interpretability . Feature importance analyses identify a small set of conceptually clear variables and interactions that explain most of the model’s performance. Archetypes themselves provide a human-readable typology of student trajectories. Qualitative feedback from advisors at FACULTY B-UNIVERSITY X confirms that archetypes match their tacit categories (e.g., “chronic repeaters”, “good but overwhelmed students”, “exemplary trajectories”), which increases trust and willingness to use the system. This contrasts with previous pilots using opaque models, which advisors found difficult to interpret and, consequently, to act upon. Interpretability is not a cosmetic advantage. In high-stakes settings such as academic progression, institutional actors must be able to explain and justify decisions. Archetypes and their defining features offer precisely that: a language that bridges statistical output and pedagogical action. 8.2.2. Theory-Driven Feature Engineering vs. Purely Data-Driven Selection Many educational data mining (EDM) studies start from hundreds of candidate variables and rely on automated selection. CAPIRE follows the opposite path: it starts from a constrained, theory-driven feature dictionary anchored in multilevel models of student persistence. The FACULTY B-UNIVERSITY X results show that a relatively compact, theoretically guided set of 44 features can match or surpass the performance of broader, theory-free feature sets reported in the literature. This has three advantages: Transferability : Features defined in terms of concepts like structural vulnerability, friction, and trajectory dynamics can be re-instantiated across institutions and coUniversity Xries, whereas highly specific behavioural traces (e.g., click patterns in a particular learning platform) are often not portable. Stability : A theory-driven dictionary changes slowly; in contrast, data-driven feature sets can fluctuate from cohort to cohort, creating confusion and undermining institutional memory. Protection against spurious correlations : By constraining the design space to theoretically plausible mechanisms, CAPIRE reduces the risk of learning artefacts that are predictive in one context but meaningless or unfair in another. This does not mean that exploratory, data-driven discovery is useless. Rather, for operational early-warning systems , theory-driven feature engineering offers a more stable and ethically defensible foundation. 8.3. Implications for Early-Warning Systems and Targeted Interventions 8.3.1. Lead Time and Proactive Support Setting \(\:{T}_{V}=1.5\) years provides a substantial lead time between reliable risk identification and the typical dropout event (around 2.8 years after enrolment in our sample). This means that the system flags students when there is still a realistic window to implement meaningful support. This contrasts with reactive approaches that trigger alerts only after repeated failure or near-irreversible disengagement. By incorporating trajectory dynamics and friction metrics early, CAPIRE allows institutions to move from “late diagnosis” to proactive care . The sensitivity analysis using alternative VOTs suggests that \(\:{T}_{V}=1.5\) years offers a good compromise: signals are strong enough for high predictive accuracy, while the intervention window remains sufficiently wide. 8.3.2. Archetype-Based Interventions Rather Than One-Size-Fits-All Traditional risk scores compress heterogeneous trajectories into a single number, often routing all “high-risk” students into a generic intervention. CAPIRE, by contrast, distinguishes qualitatively different risk profiles : High-friction archetypes (e.g., Arquetipo 5) require intensive academic support in filter courses. Extra-academic risk archetypes (e.g., Arquetipo 2) call for counseling, social support, and flexible policies. Total disengagement archetypes (e.g., Arquetipo 9) point to the need for strengthened onboarding and bridge programs. Treating these groups as equivalent would blur specific needs and dilute the impact of interventions. Archetype-based design enables differentiated, targeted strategies , and it also clarifies which combinations of mechanisms are being addressed (e.g., friction, economic stress, trajectory instability). 8.3.3. Understanding the DBSCAN Outlier Group The analysis of students labelled as noise by DBSCAN reveals that they form a coherent minority pattern rather than random irregularities. Their trajectories tend to be continuous and stable, with small gaps and low variance in key indicators, even if they do not conform to the density structure of the main archetypes in the UMAP space. Subsequent re-clustering identified at least two sharply separated micro-archetypes within this group. This suggests that density-based clustering, while effective for discovering dominant patterns, can leave minority but meaningful trajectories at the margins. For CAPIRE, the outlier group is thus best understood as a documented residual population whose structure motivates further methodological work (e.g., hybrid clustering strategies) rather than as noise to be ignored. 8.4. Relationship with Causal Inference Although this article focuses on prediction and segmentation, CAPIRE is designed to facilitate causal inference in future studies. Two properties are particularly important: Temporal validity through VOT. Because all features are constructed using information available at or before \(\:{T}_{V}\) , they are suitable for defining pre-treatment covariates in quasi-experimental designs. This is essential for methods such as propensity score matching, regression discontinuity, or difference-in-differences, where post-treatment information would invalidate identification assumptions. Rich, multilevel covariate structure. The N1–N4 dictionary provides a nuanced set of confounders and mediators relevant to treatment assignment (e.g., who receives tutoring, financial aid, or counseling) and to outcomes. This increases the plausibility of conditional ignorability assumptions in observational studies. In practical terms, CAPIRE can serve as the data backbone for evaluating the impact of specific institutional policies: once archetype-based interventions are implemented, researchers can exploit the existing feature infrastructure to design rigorous causal evaluations of those interventions. 8.5. Portability and Generalization CAPIRE’s multilevel taxonomy is conceptually general: structural context (N1), entry moment (N2), performance snapshots (N3) and trajectory dynamics (N4) are relevant in community colleges, research universities, online programs and graduate schools, although their operationalization will differ. Adapting CAPIRE to new contexts primarily involves: mapping local structural indicators (e.g., census measures, financial aid schemes) to N1; encoding program-specific entry features (e.g., admission pathways, prior certifications) in N2; re-computing friction metrics (IFC) for the relevant set of courses in N3; preserving the generic logic of gaps, entropy and velocity in N4. We expect N1–N2 features to vary considerably across systems, while N3–N4 patterns (friction, progression, instability) will be more stable. Archetype coUniversity Xs and specific profiles will likely change, but the general finding that interaction effects and trajectory dynamics matter should remain robust. Nonetheless, the current study is based on a single public engineering institution in Country Q. Replication in private, non-STEM and international settings is necessary to fully assess external validity. 8.6. Ethical Considerations and Potential Harms 8.6.1. Fairness and Bias Predictive systems can inadvertently encode and reproduce historical inequities. CAPIRE addresses this risk in several ways: It avoids direct use of sensitive attributes such as race or religion; N1 structural indicators are area-level rather than individual-level. Fairness audits (not detailed here) suggest that archetype assignment and predictive errors do not differ substantially by gender or rural/urban origin. The system is explicitly human-in-the-loop : archetype labels are recommendations for advisors, not automatic decisions. Residual risks remain: structural variables may correlate with unobserved forms of discrimination, and targeted interventions could unintentionally overlook disadvantaged students who do not fit N1 criteria. Institutions using CAPIRE should therefore perform regular fairness audits and adjust policies if systematic disparities appear. 8.6.2. Stigmatization and Labelling Assigning students to “high-risk” archetypes carries the danger of stigmatization and self-fulfilling prophecies. CAPIRE mitigates this by: restricting archetype labels to internal use (students are not told their archetype); re-estimating archetypes periodically so that labels can change as trajectories change; framing interventions in terms of support and opportunity rather than deficit (“We see you are facing challenges in math; here is a support program”), and by also recognizing resilience indicators in high-risk archetypes. The ethical stance is that analytics should expand students’ options, not constrain them. 8.6.3. Resource Allocation Archetype-based targeting inevitably shapes how institutional resources are distributed. While this can increase effectiveness, it also raises questions about opportunity costs and the treatment of moderate-risk students. CAPIRE is not a replacement for universal support systems; it is a mechanism for prioritising additional, specialized interventions. Institutions must monitor whether certain groups are systematically excluded from support and ensure that targeting does not become a justification for reducing baseline services. 8.7. Limitations of the CAPIRE Framework Several limitations qualify the findings and suggest directions for improvement: Data and infrastructure requirements. CAPIRE assumes reasonably complete, longitudinal administrative data and the capacity to link external sources. Under-resourced institutions may need simplified variants (e.g., omitting N1) or staged implementations. Dynamic environments. Archetypes are estimated on historical data and may drift as curricula, policies or student populations change. Periodic re-estimation and monitoring of archetype distributions are necessary to detect and adapt to such shifts. Correlation vs. causation. The present study is predictive and descriptive. While it highlights plausible mechanisms (e.g., friction, work-study balance, temporal instability), it does not by itself establish causal effects. Interventions inspired by CAPIRE should be rigorously evaluated, ideally with quasi-experimental or experimental designs. Despite these limitations, the FACULTY B-UNIVERSITY X implementation suggests that CAPIRE provides a coherent, leakage-aware and operationally usable framework for understanding and acting upon student attrition. It offers a middle ground between purely theoretical models and purely predictive black boxes, and it lays the groundwork for future causal and comparative research. 9. CONCLUSION AND FUTURE WORK 9.1. Summary of Contributions This paper introduced CAPIRE (Comprehensive Analytics Platform for Institutional Retention Engineering), a multilevel, leakage-aware framework for student attrition modeling. We operationalized CAPIRE through an empirical study at Universidad Nacional de Region Z, Facultad de Ciencias Exactas y Tecnología (FACULTY B-UNIVERSITY X), analyzing 1,343 engineering students across 15 cohorts (2004–2019). The main contributions are: C1: Multilevel Feature Taxonomy (N1–N4) We proposed a theoretically grounded feature dictionary with 44 variables organized into four levels: N1 – Pre-entry structural context : neighborhood deprivation, proxies of family capital, local labor-market indicators. N2 – Entry moment : age at enrolment, employment status, macro-economic context at \(\:{t}_{0}\) . N3 – Academic performance and curricular friction : grades, course failures, drop (“libre”) patterns, and Instructional Friction Coefficients (IFC). N4 – Trajectory dynamics : gaps between enrolments, state entropy, retries, and velocity of curricular advance. Empirically, interaction terms (e.g., N3×N4, N2×N4) represent a minority of features but accoUniversity X for a substantial share of predictive importance, confirming that outcomes emerge from cross-level interplay rather than additive main effects. This gives an operational form to multilevel theories of persistence (Bronfenbrenner, 1979; Pascarella & Terenzini, 2005). C2: Vulnerability Observation Time (VOT) and Leakage Prevention We formalized VOT as a temporal boundary for feature construction, enforcing the use of only pre- \(\:{T}_{V}\) information. In the FACULTY B-UNIVERSITY X case, \(\:{T}_{V}=1.5\) years (end of the second academic year) provides: a strict barrier against future-information leakage; a 1.3-year lead time before the average dropout event (2.8 years after enrolment); a reproducible configuration regime (versioned YAML configurations, code-level checks on cutoff dates). This directly addresses the pervasive leakage problem in educational data mining, where performance is often overestimated by incorporating post-outcome data into features. C3: Empirical Validation via Trajectory Archetypes Using UMAP for dimensionality reduction and DBSCAN for density-based clustering on VOT-compliant features, we identified 13 trajectory archetypes that cover 63.1% of the student population. These archetypes are: Statistically robust : bootstrap stability (mean ARI = 0.614), permutation tests (p < 0.01), and robustness to hyperparameter changes. Temporally stable : cross-cohort comparison (2004–2010 vs. 2011–2019) shows attrition-rate differences under 5 percentage points for major archetypes. Predictively usable : a Random Forest classifier achieves 94.9% test accuracy in archetype assignment, with all archetypes reaching F1 ≥ 0.70 and several high-risk types exhibiting near-perfect classification. Qualitative validation with academic advisors shows that archetypes align with existing practitioner categories (e.g., “chronic repeaters”, “good but overwhelmed students”), bridging statistical structure and institutional knowledge. C4: Actionable Intervention Matrix We translated archetypes into differentiated intervention recommendations (e.g., intensive tutoring for friction-driven archetypes, counseling and support for extra-academic risk archetypes, enhanced onboarding for disengagement profiles). Rather than a single “high-risk” group, CAPIRE provides a matrix of risk mechanisms × intervention types , allowing institutions to design targeted, mechanism-aware responses instead of one-size-fits-all programs. 9.2. Methodological and Theoretical Advances 9.2.1. Resolving the Interpretability–Accuracy Trade-off CAPIRE shows that high predictive performance does not require black-box models. By combining: a theory-driven feature dictionary (N1–N4, including key interactions); a leakage-aware temporal design (VOT); and a transparent classifier (Random Forest with feature importance and archetype profiles), we obtain accuracy comparable to or exceeding deep learning approaches reported in the literature, while retaining clear interpretability. The usual trade-off between “explainable but weak” and “powerful but opaque” is weakened: much of the gain comes from better features and temporal design, not from more complex algorithms. 9.2.2. Archetypes as a Middle Ground Between Risk Scores and Case Narratives CAPIRE’s archetypes sit between individual case studies and generic risk scores: they are quantitatively derived from high-dimensional data; they remain qualitatively interpretable , with recognizable narratives (“young strivers”, “persistent friction”, “total disengagement”, “success models”); they are scalable , as a trained classifier can assign students to archetypes in real time. This reconciles person-centred and variable-centred traditions: institutions retain the richness of narrative categories while gaining the scalability and reproducibility of formal models. 9.3. Practical Implications for Institutions 9.3.1. CAPIRE as Institutional Analytics Infrastructure CAPIRE should be understood as an analytics infrastructure , not as a one-off model. Its components are reusable: The feature dictionary can be adapted to other programs and institutions, preserving the N1–N4 logic while changing local indicators. The VOT principle generalizes to other predictive tasks (course failure, time-to-degree, progression bottlenecks). The pipeline architecture supports multiple downstream uses: archetype discovery, predictive modeling, and, in future work, causal evaluation of interventions. Because the core is theory-based, it is more stable than ad hoc feature sets: institutions can update data and periodic parameter choices without rethinking the underlying conceptual structure. As the project progresses and the implementation is further consolidated, we plan to release a reference implementation of the core pipeline in an open repository, so that other institutions can inspect, adapt, and extend the framework under transparent conditions. 9.3.2. From Generic Risk to Differentiated Support For institutional practice, the key shift is from generic “at-risk” flags to mechanism-specific profiles . CAPIRE encourages administrators and advisors to ask: Is this student at risk because of curricular friction, extra-academic stress, early disengagement, or some combination? What type of support aligns with that mechanism (tutoring, counseling, financial aid, bridge programs, mentoring)? This shift improves both the pedagogical quality and the ethical defensibility of early-warning systems, making it clearer why a student is flagged and what the institution intends to do about it. 9.4. Future Research Directions 9.4.1. Cross-Institutional Validation The main limitation of this study is its single-institution scope. Ongoing collaborations with universities in Latin America and North America will test CAPIRE in different contexts (public/private, STEM/non-STEM, different welfare regimes). Key questions include: whether N3–N4 dynamics (friction, gaps, entropy) generalize more strongly than N1–N2 structures; how many archetypes emerge in other contexts and how similar they are to the FACULTY B-UNIVERSITY X profiles; whether the dominance of interaction terms in predictive importance is replicated across settings. These studies will clarify which components of CAPIRE are universal and which require strong local adaptation. 9.4.2. Causal Inference and Policy Evaluation CAPIRE is presently descriptive and predictive; it does not identify causal effects. A natural next step is to exploit the VOT-compliant feature infrastructure in quasi-experimental or experimental designs, for example: regression discontinuity designs using institutional cut-offs for support programs; difference-in-differences analyses comparing archetype-specific attrition before and after policy changes; randomized or quasi-randomized trials of interventions targeted to specific archetypes. This would move from “who is likely to drop out?” to “what actually works, for whom, and under what conditions?”, closing the loop between analytics and evidence-based policy. 9.4.3. Expansion to Other Outcomes and Methodological Refinements Future work can extend CAPIRE beyond binary attrition to: multi-state progression trajectories (on-time, delayed, dropout, graduation); course-level performance prediction for adaptive teaching; links between archetypes and post-graduation outcomes where data are available. On the methodological side, several extensions are promising: more systematic use of topological and multiscale clustering methods that preserve archetype interpretability while capturing overlapping or hierarchical structures; hybrid models that combine human-interpretable features with latent representations learned by dimensionality reduction or shallow neural architectures; fairness-aware learning schemes that explicitly constrain disparities in prediction quality across demographic or structural groups. The central constraint for all these refinements is non-negotiable: temporal validity (VOT) and interpretability must remain at the core of any extension. 9.5. Closing Reflection Student attrition is not a technical problem to be "solved" by algorithms. It is a human problem rooted in socioeconomic inequality, inadequate institutional support, and misalignment between students' needs and universities' structures. CAPIRE does not solve attrition —it provides infrastructure for institutions to understand patterns, target resources, and evaluate policies. What it offers is a disciplined way of seeing : that trajectories are heterogeneous rather than homogeneous; that risk mechanisms differ and must be addressed with different tools; that early-warning systems, if temporally valid and interpretable, can support rather than replace human judgment. The 13 archetypes at FACULTY B-UNIVERSITY X are not labels to stigmatize students but lenses to recognize heterogeneity, challenge one-size-fits-all policies, and design equitable interventions. If the framework helps retain some students who would otherwise have left—not by blaming them, but by revealing structural frictions and unmet needs—then the analytical effort will have been worthwhile. Algorithms cannot care; institutions and people can. A framework like CAPIRE is valuable only insofar as it amplifies that care, ensuring that patterns of struggle become visible early enough, and clearly enough, to act. Declarations Use of AI-assisted tools The authors used a large language model (ChatGPT, OpenAI) only for language polishing, editorial refinement, and assistance in restructuring parts of the manuscript for clarity. All conceptual, methodological, analytical, and interpretative decisions were made exclusively by the authors. The LLM did not generate any primary data, analyses, results, or conclusions. All AI-assisted suggestions were manually reviewed and validated by the authors to ensure accuracy and consistency with the scientific content of the work. Software and computational tools All data processing, feature engineering, statistical analyses, and topological modelling were conducted using open-source scientific computing tools, including Python (NumPy, Pandas, Scikit-learn), Ripser, KeplerMapper, and custom scripts developed by the authors. No proprietary analytic software was used in the production of the results reported in this article. All computations were executed on local hardware. Scripts used to generate the feature matrices and TDA-derived descriptors are available from the corresponding author upon reasonable request. Data privacy and ethics The institutional datasets used in this study contain sensitive student information protected by university regulations and national privacy laws. For this reason, the raw data cannot be shared publicly. Aggregated indicators, feature definitions, and non-identifiable analytic structures are available from the corresponding author upon reasonable request and subject to institutional approval. Authorship transparency No LLM, automated tool, or external assistant meets authorship criteria as defined by the journal. All authors take full responsibility for the integrity, originality, and accuracy of the work. Author Contribution A.A. conceptualized the study, designed the multilevel analytical framework, and developed the CAPIRE methodology. A.A. conducted the data preprocessing, feature engineering, topological data analysis, and archetype modeling. A.A. performed the statistical analyses, prepared all figures and tables, and validated the empirical results. A.A. wrote the manuscript, revised all sections for intellectual content, and approved the final version of the article. All contributions meet the journal’s authorship criteria. Data Availability Availability of data and materialsThe datasets used in this study contain sensitive and identifiable student information protected by institutional regulations and national privacy laws. For this reason, the raw data cannot be made publicly available. Access to the datasets is restricted by the university’s data governance policies, which prohibit external sharing of student-level records. Aggregated, non-identifiable data descriptors, feature definitions, and analytical code can be made available from the corresponding author upon reasonable request and subject to institutional approval. References Adadi, A., & Berrada, M. (2018). Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access, 6 , 52138–52160. http://dx.doi.org/10.1109/ACCESS.2018.2870052 Andrade-Girón, D., Sandivar-Rosas, J., Marín-Rodriguez, W., Susanibar-Ramirez, E., Toro-Dextre, E., Ausejo-Sanchez, J., Villarreal-Torres, H., & Angeles-Morales, J. (2023). Predicting student dropout based on machine learning and deep learning: A systematic review. EAI Endorsed Transactions on Scalable Information Systems, 10 (5), 1–11. https://doi.org/10.4108/eetsis.3586 Apicella, A., Isgrò, F., Prevete, R., & Sansone, C. (2024). Don’t push the button! Exploring data leakage risks in machine learning applications. Artificial Intelligence in Medicine, 154 , 102826. https://doi.org/10.1016/j.artmed.2023.102826 Bourdieu, P. (1986). The forms of capital. In J. G. Richardson (Ed.), Handbook of theory and research for the sociology of education (pp. 241–258). Greenwood. Caprotti, O. (2017). Shapes of educational data in an online calculus course. Journal of Learning Analytics, 4 (2), 78–92. https://doi.org/10.18608/jla.2017.42.5 Carlsson, G. (2009). Topology and data. Bulletin of the American Mathematical Society, 46 (2), 255–308. https://doi.org/10.1090/S0273-0979-09-01249-X Chazal, F., & Michel, B. (2021). An introduction to topological data analysis: Fundamental and practical aspects for data scientists. Frontiers in Artificial Intelligence, 4 , 667963. https://doi.org/10.3389/frai.2021.667963 Doran, D. (2018). Retention in higher education: An agent-based model of social interactions and motivated agent behavior. Journal of Artificial Societies and Social Simulation, 21 (3), 5. http://dx.doi.org/10.18564/jasss.3731 Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96) (pp. 226–231). AAAI Press. Ganley, C. M., D’Agostino, J. V., & Rittle-Johnson, B. (2017). Shape of educational data: Interdisciplinary perspectives on quantitative educational data. Journal of Learning Analytics, 4 (2), 6–11. https://doi.org/10.18608/jla.2017.42.1 Hernán, M. A., & Robins, J. M. (2020). Causal inference: What if . Chapman & Hall/CRC. IBM. (n.d.). What is data leakage in machine learning? IBM. Retrieved November 13, 2025, from https://www.ibm.com/topics/data-leakage-machine-learning Kelly, A. E. (2017). Is learning data in the right shape? Problems with the shape of educational data. Journal of Learning Analytics, 4 (2), 154–159. https://doi.org/10.18608/jla.2017.42.9 Knight, S., Wise, A. F., & Chen, B. (2017). Time for change: Why learning analytics needs temporal analysis. Journal of Learning Analytics, 4 (3), 7–17. https://doi.org/10.18608/jla.2017.43.2 Koukaras, P., & Tjortjis, C. (2025). Data preprocessing and feature engineering for data mining: Techniques, tools, and best practices. AI, 6 (10), 257. https://doi.org/10.3390/ai6100257 Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 (pp. 4765–4774). Curran Associates. McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. Journal of Open Source Software, 3 (29), 861. https://doi.org/10.21105/joss.00861 Organisation for Economic Co-operation and Development. (2003). Student engagement at school: A sense of belonging and participation. Results from PISA 2000 . OECD Publishing. Pearl, J., & Mackenzie, D. (2018). The book of why: The new science of cause and effect . Basic Books. Perry, L. B., & McConney, A. (2010). Does the SES of the school matter? An examination of socioeconomic status and student achievement using PISA 2003. International Journal of Science and Mathematics Education, 8 (3), 437–462. https://doi.org/10.1007/s10763-010-9197-0 Sirin, S. R. (2005). Socioeconomic status and academic achievement: A meta-analytic review of research. Review of Educational Research, 75 (3), 417–453. https://doi.org/10.3102/00346543075003417 Susnjak, T. (2022). Learning analytics dashboard: A tool for providing actionable feedback to students. Education and Information Technologies, 27 , 1271–1296. https://doi.org/10.1007/s10639-021-10635-8 Tinto, V. (1993). Leaving college: Rethinking the causes and cures of student attrition (2nd ed.). University of Chicago Press. UNESCO. (2019). Global education monitoring report 2019: Migration, displacement and education . UNESCO Publishing. Wilensky, U., & Rand, W. (2015). An introduction to agent-based modeling: Modeling natural, social, and engineered complex systems with NetLogo . MIT Press. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8118343","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":545325239,"identity":"a9186ee4-252f-435b-b261-bf3a9a571a1c","order_by":0,"name":"Hugo Roger Paz","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA7klEQVRIiWNgGAWjYBACAxDxsQGLIF4tjDORtEgQpYWZlyQt5tLNjz/b7rDJN2/vffiBse1OHQN78zYJhpp7OLVYzjlmYJx7Js1yzpnjxhKMbc8kGHiOlUkwHCvG7bAbCQbJuW2HDSQk0hiAWg5LMEjkmEkwsCXg0ZL+4bBl238DCflnzD/AWuTfALX8w6clx7CZse0A0BY2NqgtPGZABh4td84UM/a2JRtI8KSxWSScOyzZxpNWbJHYh0fL7fbNH3622RlIsB9jvvGh7DA/P/vhjTc+fMOtBRwPcABSxwZjEKdlFIyCUTAKRgE2AAC3ikxaFi9GYwAAAABJRU5ErkJggg==","orcid":"","institution":"National University of Tucumán","correspondingAuthor":true,"prefix":"","firstName":"Hugo","middleName":"Roger","lastName":"Paz","suffix":""}],"badges":[],"createdAt":"2025-11-14 22:23:13","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8118343/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8118343/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":96250102,"identity":"c717018b-a0e1-4a29-a711-9c590eba1c5d","added_by":"auto","created_at":"2025-11-19 07:37:27","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":721050,"visible":true,"origin":"","legend":"","description":"","filename":"CAPIREdoubleblind.docx","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/a8c7d02039ca3aca2ca3b709.docx"},{"id":96251264,"identity":"ee860fd3-03c2-484d-ad90-616e728b2b39","added_by":"auto","created_at":"2025-11-19 07:39:34","extension":"json","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":4530,"visible":true,"origin":"","legend":"","description":"","filename":"4866a416618949f888be5ce18bff5ffb.json","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/a08c5465130242dd3a65b3b9.json"},{"id":96249014,"identity":"c6032a18-293c-4b7c-9658-0a03843d3b57","added_by":"auto","created_at":"2025-11-19 07:29:55","extension":"xml","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":230240,"visible":true,"origin":"","legend":"","description":"","filename":"4866a416618949f888be5ce18bff5ffb1enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/c22814417fc7008d0765ccd4.xml"},{"id":96251254,"identity":"9950b247-b50f-4e98-8807-9830350ac04a","added_by":"auto","created_at":"2025-11-19 07:39:34","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":73157,"visible":true,"origin":"","legend":"","description":"","filename":"Figure3.1CAPIREMultilevelArchitecture.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/90a4106af98a70e4ac590f64.png"},{"id":96248376,"identity":"824fd4f5-6a82-4bfa-a524-f4e67496c090","added_by":"auto","created_at":"2025-11-19 07:28:23","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":58063,"visible":true,"origin":"","legend":"","description":"","filename":"Figure3.2CAPIREinInstitutionalPractice.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/239ecf7510479e9d51bc511f.png"},{"id":96143517,"identity":"610f566d-1830-459f-bd55-0b622e841047","added_by":"auto","created_at":"2025-11-18 06:05:51","extension":"jpg","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":154109,"visible":true,"origin":"","legend":"","description":"","filename":"Figure4.1EntityRelationshipDiagram.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/ddaaf1f575449f144d73f5e2.jpg"},{"id":96251241,"identity":"ae9662d1-14cd-4293-b395-8462faeb6fcf","added_by":"auto","created_at":"2025-11-19 07:39:32","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":370282,"visible":true,"origin":"","legend":"","description":"","filename":"Figure6.1CAPIRECoreModularArchitecture.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/1b902f3d8cc82f978690791d.png"},{"id":96143523,"identity":"eb1da10b-6b69-478e-9f88-09117a5817f2","added_by":"auto","created_at":"2025-11-18 06:05:51","extension":"png","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":305144,"visible":true,"origin":"","legend":"","description":"","filename":"Figure7.5heatmaparquetiposzscoreclustered.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/8986bb605185dd6dcfd96e47.png"},{"id":96249173,"identity":"fd10e417-9675-4d13-a501-16a3390cd80a","added_by":"auto","created_at":"2025-11-19 07:30:30","extension":"png","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":346275,"visible":true,"origin":"","legend":"","description":"","filename":"Figure8.1TemporalDistributionDropout.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/b10567cacc4137f53714619b.png"},{"id":96249147,"identity":"ce380045-046c-4fe9-8358-3aaccca45a49","added_by":"auto","created_at":"2025-11-19 07:30:26","extension":"png","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":37141,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/b915efa683183fba7512eefe.png"},{"id":96143531,"identity":"994c5e96-ea2d-4632-a3bf-5a8a54c5da5c","added_by":"auto","created_at":"2025-11-18 06:05:52","extension":"png","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":58063,"visible":true,"origin":"","legend":"","description":"","filename":"Figure3.2CAPIREinInstitutionalPractice.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/8e2f7bab17b87d68f477bcbc.png"},{"id":96143526,"identity":"d36981a9-c8b2-4d0c-a118-fa92aa894502","added_by":"auto","created_at":"2025-11-18 06:05:51","extension":"png","order_by":17,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":370282,"visible":true,"origin":"","legend":"","description":"","filename":"Figure6.1CAPIRECoreModularArchitecture.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/eac8b79a51cafd9478faee39.png"},{"id":96143513,"identity":"0038dbdd-e3a6-4604-92f1-81da53140197","added_by":"auto","created_at":"2025-11-18 06:05:51","extension":"png","order_by":18,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":127148,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/1ee03cad821444d6a99dc3b9.png"},{"id":96248915,"identity":"6a09af8b-387c-4ae4-81c7-b0ff441f930d","added_by":"auto","created_at":"2025-11-19 07:29:40","extension":"png","order_by":19,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":14034,"visible":true,"origin":"","legend":"","description":"","filename":"OnlineFigure3.1CAPIREMultilevelArchitecture.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/5ba1926d5c95c7f353e0c0b5.png"},{"id":96250144,"identity":"5d49a194-4605-4154-87cf-bdd73d15ce1a","added_by":"auto","created_at":"2025-11-19 07:37:37","extension":"png","order_by":20,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":23214,"visible":true,"origin":"","legend":"","description":"","filename":"OnlineFigure3.2CAPIREinInstitutionalPractice.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/1b59db87d7848d8130ce2123.png"},{"id":96249096,"identity":"bebf092b-7f8b-438a-8eec-45d402d73eb4","added_by":"auto","created_at":"2025-11-19 07:30:15","extension":"png","order_by":21,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":12059,"visible":true,"origin":"","legend":"","description":"","filename":"OnlineFigure4.1EntityRelationshipDiagram.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/4fe26dbc44119fe7d13bd4b2.png"},{"id":96250244,"identity":"c450f2a6-6e42-4a50-acce-9e49ba92535a","added_by":"auto","created_at":"2025-11-19 07:37:48","extension":"png","order_by":22,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":39047,"visible":true,"origin":"","legend":"","description":"","filename":"OnlineFigure6.1CAPIRECoreModularArchitecture.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/ea994c22996de49defcc7160.png"},{"id":96143528,"identity":"b1898685-d272-4344-890d-2a7fd25ddf5a","added_by":"auto","created_at":"2025-11-18 06:05:51","extension":"png","order_by":23,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":108361,"visible":true,"origin":"","legend":"","description":"","filename":"OnlineFigure7.5heatmaparquetiposzscoreclustered.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/e25aa6788592191ad5961a66.png"},{"id":96143529,"identity":"b025a745-9db2-4479-a86e-54be910267e1","added_by":"auto","created_at":"2025-11-18 06:05:51","extension":"png","order_by":24,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":81715,"visible":true,"origin":"","legend":"","description":"","filename":"OnlineFigure8.1TemporalDistributionDropout.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/bc62cfa0dbb05aecf4c37e03.png"},{"id":96250073,"identity":"7c273014-8866-406b-9327-02cf6589e81f","added_by":"auto","created_at":"2025-11-19 07:37:21","extension":"png","order_by":25,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":14639,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/46f24dd9cfd19d7c094d50a7.png"},{"id":96143524,"identity":"5b39f86f-2b08-49c9-8ebb-67a7f7040c28","added_by":"auto","created_at":"2025-11-18 06:05:51","extension":"png","order_by":26,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":23214,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/545acb38333fc609a095e0e1.png"},{"id":96143525,"identity":"4206267c-62b5-4712-8dcb-3dc1dfbec9db","added_by":"auto","created_at":"2025-11-18 06:05:51","extension":"png","order_by":27,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":39047,"visible":true,"origin":"","legend":"","description":"","filename":"OnlineFigure6.1CAPIRECoreModularArchitecture.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/8196564e4f2355ba2b365971.png"},{"id":96250093,"identity":"6cdc5d2a-ab6f-40ac-a5ce-26eed57e4935","added_by":"auto","created_at":"2025-11-19 07:37:26","extension":"png","order_by":28,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":37418,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/c5d16ed2fc8614ba31c11a35.png"},{"id":96143532,"identity":"0307f0ed-0f69-4bdd-b008-caaee19a6d24","added_by":"auto","created_at":"2025-11-18 06:05:52","extension":"xml","order_by":29,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":223179,"visible":true,"origin":"","legend":"","description":"","filename":"4866a416618949f888be5ce18bff5ffb1structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/fc66009eee995db8fd8f08fb.xml"},{"id":96143530,"identity":"57566b17-0832-47d0-93f5-8b9033d2c01e","added_by":"auto","created_at":"2025-11-18 06:05:52","extension":"html","order_by":30,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":264723,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/ce58827671635c3e84f717b6.html"},{"id":96143504,"identity":"b19fa178-5dd2-4277-afa5-bac1cc754680","added_by":"auto","created_at":"2025-11-18 06:05:51","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":73157,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFigure 3.1: CAPIRE Multilevel Architecture (Conceptual Diagram)\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Figure3.1CAPIREMultilevelArchitecture.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/0e9767f3b94898c474df67c9.png"},{"id":96143505,"identity":"10fc5656-c242-4806-9066-47c48591221d","added_by":"auto","created_at":"2025-11-18 06:05:51","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":58063,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFigure 3.2. CAPIRE within institutional decision-making cycles\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Figure3.2CAPIREinInstitutionalPractice.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/59299f28dd56b2331d659b50.png"},{"id":96143507,"identity":"68c2981a-836d-427c-90d8-6797e853d6e7","added_by":"auto","created_at":"2025-11-18 06:05:51","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":154109,"visible":true,"origin":"","legend":"\u003cp\u003eLegend not included with this version\u003c/p\u003e","description":"","filename":"Figure4.1EntityRelationshipDiagram.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/0343f809a1fa3cad87cf6fb1.jpg"},{"id":96143511,"identity":"8e933bac-fbfe-4f0c-8967-854ce87042dd","added_by":"auto","created_at":"2025-11-18 06:05:51","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":370282,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFigure 6.1: CAPIRE-Core Modular Architecture\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Figure6.1CAPIRECoreModularArchitecture.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/f27d6cd6aade20e3c84efb90.png"},{"id":96143506,"identity":"4d7850d3-eece-4488-803c-1dc960060f56","added_by":"auto","created_at":"2025-11-18 06:05:51","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":305144,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFigure 7.5. Standardized Feature Profiles Across the 13 Student Archetypes (Z-score)\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Figure7.5heatmaparquetiposzscoreclustered.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/43f884ae83f54e70a4ed1293.png"},{"id":96143510,"identity":"feaee917-9801-462c-b93f-238c0cfc310a","added_by":"auto","created_at":"2025-11-18 06:05:51","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":346275,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend\u003c/p\u003e","description":"","filename":"Figure8.1TemporalDistributionDropout.png","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/061ebffd4057c3bac0d98aca.png"},{"id":96624454,"identity":"db677602-9f21-4212-970b-6ec23083a7d7","added_by":"auto","created_at":"2025-11-24 11:24:10","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":7398396,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8118343/v1/277550c0-e27e-4a37-a201-4168315c2735.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"A Leakage-Aware Data Layer For Student Analytics: The Capire Framework For Multilevel Trajectory Modeling","fulltext":[{"header":"1. INTRODUCTION","content":"\u003cp\u003e\u003cstrong\u003e1.1. Motivation: Beyond dropout prediction towards explanatory frameworks\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eStudent attrition in higher education remains a structurally persistent problem rather than a marginal anomaly [15]. Global estimates suggest that roughly one third of students who begin a tertiary programme do not complete it, with even higher non-completion in many Latin American systems [16]. This loss of human capital constrains social mobility, reinforces inequality, and undermines institutional missions, especially in regions already affected by deep socio-economic asymmetries and the post-pandemic learning crisis (UNESCO, 2019).\u003c/p\u003e\n\u003cp\u003eOver the last decade, Educational Data Mining (EDM) and Learning Analytics (LA) have produced increasingly accurate early-warning models that identify students at risk of dropout or academic failure using administrative data, LMS logs, and assessment records. Studies in distance, blended, and on-campus contexts show that machine-learning models can reach AUCs above 0.80 using combinations of grades, attendance, and clickstream features (e.g., Andrade-Gir\u0026oacute;n et al., 2023). Recent reviews confirm that feature engineering and feature selection are central levers for performance in EDM pipelines (Koukaras \u0026amp; Tjortjis, 2025).\u003c/p\u003e\n\u003cp\u003eHowever, three structural limitations remain. First, opacity: high-capacity models are often \u0026ldquo;black boxes\u0026rdquo; that return risk scores without intelligible explanations, which undermines trust and hampers the design of targeted interventions. Second, correlation\u0026ndash;causation conflation: predictive success does not clarify which mechanisms actually drive retention or which interventions will work for whom. Third, a symptom-level focus: many models treat observable behaviours (e.g., missed classes, LMS inactivity) as explanatory variables rather than manifestations of deeper psychosocial, neurobiological, or structural processes. As a result, early-warning systems may correctly flag \u0026ldquo;who\u0026rdquo; is at risk but provide little insight into \u0026ldquo;why\u0026rdquo; students struggle or \u0026ldquo;how\u0026rdquo; institutions should respond.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e1.2. Methodological gaps: Feature engineering and temporal validity\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eEDM and LA studies also face more technical, but equally consequential, methodological gaps. A recurrent issue is the opportunistic construction of features, driven by convenience rather than theory. Predictors are often assembled from whichever variables happen to be available in institutional databases, leading to flat feature spaces that under-represent socio-economic structure, curricular design, and institutional dynamics. This practice limits both interpretability and transferability across contexts.\u003c/p\u003e\n\u003cp\u003eA second, increasingly recognised problem is data leakage. Many published models inadvertently incorporate information from the future into the observation window\u0026mdash;such as grades obtained after the supposed prediction point\u0026mdash;thereby inflating accuracy estimates and compromising the validity of early-warning claims. Leakage is rarely documented explicitly, and temporal design decisions (e.g., where to cut the observation window) are often left implicit or ambiguous.\u003c/p\u003e\n\u003cp\u003eTaken together, these gaps hinder the development of explanatory frameworks that connect retention theory with predictive modelling and institutional decision-making. What is missing is not yet another classifier, but a disciplined way of transforming raw longitudinal data into temporally honest, theory-informed feature spaces that can support both archetype discovery and early-warning systems.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e1.3. Aim and contribution of this paper\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis paper addresses these gaps by introducing a leakage-aware data layer for student trajectory analytics, which serves as the methodological foundation for the CAPIRE (Comprehensive Analytics Platform for Institutional Retention Engineering) framework. The proposed design organises predictors into four levels: N1 (personal and socio-economic attributes), N2 (academic history and friction indicators), N3 (curricular structure and workload), and N4 (institutional and macro-context variables). As a core component, we formalise the Value of Observation Time (VOT) as a design parameter that rigorously separates observation windows from outcome horizons, preventing data leakage by construction.\u003c/p\u003e\n\u003cp\u003eAn illustrative application in a long-cycle engineering programme (1,343 students, ~57% dropout) demonstrates that VOT-restricted multilevel features support robust archetype discovery. A UMAP + DBSCAN pipeline uncovers 13 trajectory archetypes, including profiles of \u0026ldquo;early structural crisis\u0026rdquo;, \u0026ldquo;sustained friction\u0026rdquo;, and \u0026ldquo;hidden vulnerability\u0026rdquo; (low friction but high dropout). Bootstrap and permutation tests confirm that these archetypes are statistically robust and temporally stable, while a dedicated analysis of DBSCAN-labelled \u0026ldquo;noise\u0026rdquo; reveals coherent minority micro-archetypes rather than heterogeneous outliers.\u003c/p\u003e\n\u003cp\u003eIn this article, we focus specifically on the construction and empirical validation of the leakage-aware data layer and the associated archetype model. Although the broader CAPIRE roadmap includes causal inference, explainable AI, and agent-based modelling, these components fall outside the scope of the present work. Here, we concentrate on demonstrating that a carefully engineered, VOT-compliant feature space can act as a reusable bridge between retention theory, early-warning systems, and future causal and simulation-based analyses.\u003c/p\u003e"},{"header":"2. BACKGROUND AND RELATED WORK","content":"\u003cp\u003eStudent attrition in higher education is a multidimensional phenomenon shaped by sociological, psychological, and institutional forces. This section positions the CAPIRE framework within contemporary research, outlining foundational theories, methodological advances in educational data mining, and structural gaps that justify a rigorous multilevel and leakage-aware approach.\u003c/p\u003e\u003cdiv id=\"Sec2\" class=\"Section2\"\u003e\u003ch2\u003e2.1. Theoretical Foundations of Student Retention\u003c/h2\u003e\u003cp\u003eClassical models conceptualised student departure either as an individual deficit or as a failure of institutional integration. Spady\u0026rsquo;s (1970) sociological model framed attrition as the result of insufficient academic and social fit, while Tinto\u0026rsquo;s (1975) Integration Model argued that persistence emerges from successful engagement with the academic and social systems of the university. Later revisions (Tinto, 2017) recognised substantial heterogeneity among non-traditional students, emphasising the interplay of life circumstances, motivations, and structural constraints.\u003c/p\u003e\u003cp\u003eSociological perspectives expanded the analytical lens to structural determinants. Bourdieu\u0026rsquo;s (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e1986\u003c/span\u003e) theory of cultural capital highlighted how socioeconomic origin mediates academic performance independently of formal ability. Three decades of research summarised by Pascarella and Terenzini (2005) confirmed that pre-entry characteristics, academic preparation, and institutional context interact non-linearly, undermining models that assume additive and independent effects.\u003c/p\u003e\u003cp\u003ePsychological approaches introduced motivational and identity-based mechanisms. Bean and Eaton (2000), drawing on Bandura\u0026rsquo;s (1997) self-efficacy theory, argued that beliefs about academic capability mediate the effect of institutional experiences on persistence. Rend\u0026oacute;n\u0026rsquo;s (1994) concept of validation underscored the importance of faculty and peer affirmation in sustaining engagement, particularly among first-generation and minoritised students.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2.1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eMajor Theoretical Frameworks in Student Retention\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"4\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFramework\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eKey Mechanism\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eLevel of Analysis\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eRelevance to CAPIRE\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTinto (1975, 2017)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAcademic and social integration\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eIndividual\u0026ndash;Institutional\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eJustifies multilevel N1\u0026ndash;N3 features\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBourdieu (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e1986\u003c/span\u003e)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTransmission of cultural capital\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eSocioeconomic\u0026ndash;Institutional\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eN1 features (SES, parental education)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBean \u0026amp; Eaton (2000)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSelf-efficacy and coping mechanisms\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003ePsychological\u0026ndash;Academic\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eN3 friction and early performance\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAstin (1993)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eInput\u0026ndash;Environment\u0026ndash;Outcome (I-E-O)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMultilevel\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eN2 inputs, N3/N4 environment\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBraxton et al. (2004)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eRevised integration model for commuter students\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eInstitutional\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eN4 institutional policies\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003ctfoot\u003e\u003ctr\u003e\u003ctd colspan=\"4\"\u003e\u003cb\u003eSynthesis\u003c/b\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tfoot\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThese perspectives converge on a multilevel ontology: retention arises from interactions between pre-entry characteristics (N1\u0026ndash;N2), institutional structures (N3\u0026ndash;N4), and students\u0026rsquo; agentic responses. CAPIRE operationalises these theories through feature engineering rather than latent variable modelling, privileging predictive validity, interpretability, and replicability.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.2. Feature Engineering in Educational Data Mining\u003c/h2\u003e\u003cp\u003eThe maturation of institutional data warehouses\u0026mdash;student information systems (SIS) and learning management systems (LMS)\u0026mdash;shifted retention research from survey-driven analysis to data-intensive predictive modelling (Romero \u0026amp; Ventura, 2020). Early work centred on academic indicators such as GPA and credits earned (Delen, 2010; Kabakchieva, 2013), achieving moderate predictive accuracy but suffering from three structural problems: (1) temporal invalidity due to post-hoc GPA usage, (2) limited actionability, and (3) reliance on opaque black-box models.\u003c/p\u003e\u003cp\u003eSubsequent efforts emphasised behavioural indicators. Purdue University's \u003cem\u003eCourse Signals\u003c/em\u003e (Arnold \u0026amp; Pistilli, 2012) integrated early assessments and LMS engagement metrics to provide real-time risk alerts, though its reliance on within-course behaviour restricted its ability to flag students with minimal engagement. Deep learning approaches incorporating clickstream sequences (Hu \u0026amp; Rangwala, 2020; Whitehill et al., 2017) improved accuracy (82\u0026ndash;87%) but further reduced interpretability.\u003c/p\u003e\u003cp\u003eCurricular friction emerged as a parallel line of inquiry. Adelman\u0026rsquo;s (2006) \u0026ldquo;momentum points\u0026rdquo; and Seidman\u0026rsquo;s (2005) notion of \u0026ldquo;gateway courses\u0026rdquo; highlighted structural bottlenecks in academic progress, yet few studies operationalised friction at the individual level. Bowen et al. (2009) analysed course-level pass rates but did not incorporate these into student-specific feature vectors. CAPIRE addresses this gap with the Instructional Friction Coefficient (IFC), a weighted metric combining course-level failure and withdrawal rates into personalised friction trajectories.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2.2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eEvolution of Feature Sets in Attrition Prediction\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eEra\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTypical Features\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eExample Studies\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eAccuracy Range\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eLimitation\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePre-2000\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eDemographics, SAT, HS GPA\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eTinto (1975), Cabrera et al. (1992)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e\u0026ndash;\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eSurvey-based, small samples\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e2000\u0026ndash;2010\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e+ College GPA, credits\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eHerzog (2005), Delen (2010)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e68\u0026ndash;75%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eTemporal leakage\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e2010\u0026ndash;2015\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e+ LMS behaviour\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eArnold \u0026amp; Pistilli (2012), Macfadyen \u0026amp; Dawson (2010)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e75\u0026ndash;82%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eNo longitudinal modelling\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e2015\u0026ndash;2020\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e+ Deep learning sequences\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eHu \u0026amp; Rangwala (2020)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e82\u0026ndash;87%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eNon-interpretable\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e2020\u0026ndash;present\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e+ Multilevel SES\u0026thinsp;+\u0026thinsp;curriculum\u0026thinsp;+\u0026thinsp;trajectories\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eGardner et al. (2019), CAPIRE\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e88\u0026ndash;95%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eRequires rich administrative data\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eSynthesis\u003c/b\u003e\u003c/p\u003e\u003cp\u003eCAPIRE integrates pre-entry capital (N1\u0026ndash;N2), curricular friction (N3), and temporal dynamics (N4) into a unified, theory-driven taxonomy. Unlike ad-hoc feature selection common in EDM, CAPIRE ensures cross-institutional portability and methodological transparency.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e2.3. Multilevel Models and Interaction Effects\u003c/h2\u003e\u003cp\u003eEducational outcomes exhibit nested structure\u0026mdash;students within courses, within programmes, within institutions\u0026mdash;and hierarchical linear models (HLM; Raudenbush \u0026amp; Bryk, 2002) were developed to capture variance at each level. Empirical studies, such as Engberg and Wolniak (2010), have demonstrated cross-level moderation effects: for instance, the relationship between first-year GPA and persistence varies according to institutional selectivity, highlighting peer- and environment-dependent dynamics.\u003c/p\u003e\u003cp\u003eNevertheless, uptake of HLM in EDM has been limited due to computational burden, distributional assumptions, and weaker predictive performance compared with machine learning (James et al., 2013). CAPIRE adopts a pragmatic alternative: theorised interaction terms engineered directly into the feature set, preserving the multilevel ontology while enabling scalable model training.\u003c/p\u003e\u003cp\u003eResearch substantiates the importance of such interactions. Goldrick-Rab (2006) showed that financial aid effects vary by academic preparation, while Stinebrickner and Stinebrickner (2014) documented behavioural interactions between study habits and assessment timing. CAPIRE systematically constructs interaction terms (Section \u003cspan refid=\"Sec19\" class=\"InternalRef\"\u003e4.6\u003c/span\u003e), with empirical analysis demonstrating that these features explain 37% of total model gain.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e2.4. Data Leakage in Predictive Modelling\u003c/h2\u003e\u003cp\u003eData leakage\u0026mdash;the inadvertent introduction of future or target-derived information into the training process\u0026mdash;is pervasive in learning analytics and severely compromises real-world performance (Kaufman et al., 2020). Common leakage pathways include:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTemporal leakage\u003c/b\u003e: using cumulative GPA or credits earned after the prediction horizon.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTarget leakage\u003c/b\u003e: including proxies for the outcome (e.g., number of semesters completed).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eLabel leakage\u003c/b\u003e: constructing labels using post-prediction behaviours.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTrain\u0026ndash;test contamination\u003c/b\u003e: fitting preprocessing steps on the full dataset.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eKaufman et al. (2020) found evidence of leakage in 78% of 50 audited EDM papers, inflating accuracy by a median of 12 percentage points.\u003c/p\u003e\u003cp\u003eCAPIRE prevents leakage through strict Vulnerability Observation Time (VOT) filtering, temporal slicing at feature-engineering time, prohibition of post-hoc aggregates, and automated configuration logging. In contrast to other approaches\u0026mdash;such as Course Signals (Arnold \u0026amp; Pistilli, 2012) or survival models using cumulative features (Gardner et al., 2019)\u0026mdash;CAPIRE enforces temporal validity across the entire pipeline.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e2.5. Topological and Archetype-Based Approaches\u003c/h2\u003e\u003cp\u003eTraditional clustering methods (k-means, hierarchical clustering) partition students into discrete, mutually exclusive groups, but educational trajectories often exhibit continuous transitions. Topological Data Analysis (TDA) explicitly represents such structure by preserving connectivity in high-dimensional data (Lum et al., 2013).\u003c/p\u003e\u003cp\u003eMapper (Singh et al., 2007) constructs a topological network via low-dimensional projections, interval coverings, and local clustering, revealing flares, loops, and multi-path trajectories. Applied to student data, Mapper has uncovered progression types and dispersed outlier populations (Chodrow et al., 2021), though the resulting micro-clusters (~\u0026thinsp;50 per model) complicate interpretability.\u003c/p\u003e\u003cp\u003eOur attempts to apply Mapper (Section \u003cspan refid=\"Sec43\" class=\"InternalRef\"\u003e7.3.1\u003c/span\u003e) produced\u0026thinsp;~\u0026thinsp;50 micro-clusters (mean size 27), insufficient for actionable archetype design. DBSCAN on UMAP embeddings provided a superior balance between expressiveness and operational usefulness, yielding 13 interpretable archetypes with adequate population size (40\u0026ndash;109 students).\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\u003ch2\u003e2.6. Gaps Addressed by CAPIRE\u003c/h2\u003e\u003cp\u003eDespite substantial advances, key gaps persist:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eLack of systematic feature-engineering frameworks\u003c/b\u003e, limiting external validity.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eInadequate handling of temporal leakage\u003c/b\u003e, producing overstated accuracy.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eInterpretability trade-offs\u003c/b\u003e, with deep models often opaque.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eLimited actionability\u003c/b\u003e, with risk prediction unconnected to intervention design.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003ePoor generalisability\u003c/b\u003e, due to heterogeneous institutional contexts.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eCAPIRE contributes:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003ea reusable, theory-grounded feature taxonomy (N1\u0026ndash;N4);\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eformalised leakage prevention through VOT;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003einterpretable archetypes linked to mechanisms and interventions;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eguidelines for cross-institutional adaptation.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe empirical demonstration in Section \u003cspan refid=\"Sec36\" class=\"InternalRef\"\u003e7\u003c/span\u003e validates these contributions and confirms CAPIRE\u0026rsquo;s viability for institutional decision-making.\u003c/p\u003e\u003c/div\u003e"},{"header":"3. THE CAPIRE FRAMEWORK: CONCEPTUAL OVERVIEW","content":"\u003cp\u003eThe CAPIRE (Comprehensive Analytics Platform for Institutional Retention Engineering) framework represents a shift from black-box dropout prediction to theory-driven, multilevel feature engineering designed for institutional use. Rather than producing opaque risk scores, CAPIRE generates interpretable trajectory archetypes that summarise distinct patterns of student progression. These archetypes support differentiated interventions aligned with specific mechanisms of vulnerability, bridging the gap between predictive modelling and educational practice.\u003c/p\u003e\u003cp\u003eThis section presents CAPIRE\u0026rsquo;s design principles, multilevel architecture, archetype-based view of student trajectories, and its role within institutional decision-making cycles.\u003c/p\u003e\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e3.1. Design Principles\u003c/h2\u003e\u003cp\u003eCAPIRE is guided by four design principles that differentiate it from conventional early-warning systems.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3.1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eCAPIRE Design Principles\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"4\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePrinciple\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eRationale\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eOperationalisation\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eContrast with standard approaches\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMultilevel\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eEducational outcomes emerge from nested contexts (individual, curricular, institutional, societal).\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eFour-level feature taxonomy (N1\u0026ndash;N4) capturing pre-entry, entry, curricular, and temporal\u0026ndash;institutional factors.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eMost EDM models use flat feature spaces, ignoring cross-level interactions.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eExplanatory\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003ePredictions must be intelligible and mechanistically grounded to inform practice.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eFeature importance analysis and archetype profiling reveal \u003cem\u003ewhy\u003c/em\u003e students are vulnerable.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eBlack-box models (deep learning, complex ensembles) privilege accuracy over explanation.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLeakage-aware\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTemporal validity is a precondition for deployment, not an optional refinement.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eVulnerability Observation Time (VOT) enforces strict temporal boundaries on feature construction.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eA large share of published EDM work contains unaddressed temporal leakage.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePolicy-oriented\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAnalytics should connect directly to institutional levers and support design.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eArchetype-to-intervention matrices specify differentiated programmes and services.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eStandard risk scores rarely specify what to do or for whom.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThese principles reflect a pragmatic epistemology: CAPIRE prioritises operational usefulness, transparency, and temporal honesty over maximal algorithmic sophistication. While multilevel structural models or causal graphical frameworks offer theoretical elegance, CAPIRE combines disciplined feature engineering with robust machine learning to achieve interpretable, scalable insights that can be embedded in everyday institutional workflows.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\u003ch2\u003e3.2. Multilevel Architecture: The N1\u0026ndash;N4 Feature Taxonomy\u003c/h2\u003e\u003cp\u003eCAPIRE organises predictors into four analytically distinct but empirically interacting levels, aligned with socio-ecological models of student development (Bronfenbrenner, 1979; Pascarella \u0026amp; Terenzini, 2005). Figure\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e3.1\u003c/span\u003e provides a conceptual overview.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eN1: Pre-entry context (structural conditions)\u003c/b\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eTheoretical grounding\u003c/strong\u003e\u003cp\u003eBourdieu\u0026rsquo;s (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e1986\u003c/span\u003e) cultural capital; Lareau\u0026rsquo;s (2011) unequal childhoods.\u003c/p\u003e\u003c/p\u003e\u003cp\u003eN1 features describe the structural context in which students are socialised prior to entering higher education, including:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eNeighbourhood deprivation\u003c/b\u003e: indices of poverty, housing quality, and educational access linked to postcode (e.g., NBI in the Argentine context).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eFamily educational capital\u003c/b\u003e: parental educational attainment, siblings with university attendance.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eGeographical origin\u003c/b\u003e: rural/urban status; distance to campus as a proxy for commuting burden and social integration costs.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eKey insight\u003c/strong\u003e\u003cp\u003eIn our empirical setting, N1 variables display limited direct predictive importance but exert indirect effects through downstream mechanisms (e.g., poverty \u0026rarr; need to work while studying \u0026rarr; increased exposure to high-friction courses). CAPIRE therefore preserves N1 information to support interpretation and fairness analysis, even when its marginal contribution to accuracy is modest.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eN2: Entry moment (initial conditions)\u003c/b\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eTheoretical grounding\u003c/strong\u003e\u003cp\u003eAstin\u0026rsquo;s (1993) Input\u0026ndash;Environment\u0026ndash;Outcome (I\u0026ndash;E\u0026ndash;O) model; Tinto\u0026rsquo;s (1975) pre-entry attributes.\u003c/p\u003e\u003c/p\u003e\u003cp\u003eN2 features capture characteristics at, or immediately surrounding, the moment of enrolment:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eDemographics\u003c/b\u003e: age at entry, gender, marital status.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eEmployment status\u003c/b\u003e: whether the student works while studying and, when available, approximate workload.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eAcademic preparation\u003c/b\u003e: upper-secondary performance, entrance examination scores where applicable.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eMacro-contextual conditions\u003c/b\u003e: year-of-entry indicators such as inflation, unemployment, or prolonged strikes in the public system.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eKey insight\u003c/strong\u003e\u003cp\u003eAge at entry emerges as one of the most influential predictors in the CAPIRE case study, consistently ranking among the top features. This aligns with evidence that \u0026ldquo;non-traditional\u0026rdquo; entrants face distinct constraints and opportunity costs.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eN3: Curricular structure and academic friction\u003c/b\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eTheoretical grounding\u003c/strong\u003e\u003cp\u003eAdelman\u0026rsquo;s (2006) momentum points; Seidman\u0026rsquo;s (2005) gateway/chokepoint courses.\u003c/p\u003e\u003c/p\u003e\u003cp\u003eN3 features characterise how students interact with the curriculum during the observation window defined by VOT:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003ePerformance metrics\u003c/b\u003e: grades, pass/fail coUniversity Xs, number of attempts per course.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eEnrolment patterns\u003c/b\u003e: subjects attempted, dropped, or re-taken.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eCurricular friction\u003c/b\u003e: the Instructional Friction Coefficient (IFC), a weighted measure of course-level difficulty based on withdrawal and failure patterns.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eFormally, for course \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:j\\)\u003c/span\u003e\u003c/span\u003e,\u003c/p\u003e\u003cp\u003e\u003cimg src=\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAeoAAABRCAYAAAAdF+U4AAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAAFiUAABYlAUlSJPAAABhISURBVHhe7d3fayPX2Qfwr977pjuyr9I0FI8XGpqwCxknJZECW9iMuiklpd3IG3pRaKg9aihsyVpZbXK1XScjtje9sOXQwFJCJactDUvlWl5wIFKWEKtBooUsrEcsJfRKsrbtH3B6ET3nnTma0Q9blkfe5wNDNufMjGbkOecZzZwfESGEAGOMMcZC6f/UBMYYY4yFBwdqxhhjLMQ4UDPGGGMhxoGaMcYYCzEO1IwxxliIcaBmjDHGQowDdUhVKhVEIhHfJRqNIpFIYG1tDa1WS92UBUgkEohEIqhUKmoWY4yFFgfqCWCapmdpt9solUpYXFzEyZMnOfAwxtgxxoF6AmxubnoWIQTK5bIM2vF4nIM1Y8dMq9Xics0ADtSTKxaLYXNzE5ZlAQB+8pOfqKswxg5JNpvteiXlt2SzWXXTgf385z9HPB7H2tqaJ532fdjo9VsikVCz2JhxoJ5wV69eha7rcBwHhULBk0fvZNGpWGZnZ7sKXb1eRyqVQjQalRXA/Px84J28e58bGxuYm5vzbFev19VNZIHPZrNotVqez5udne1ZmTUajaGOD65zpfVTqRQajYa6GmMHpmla16sp9/Loo4+qmwxM13UAwNe//nU1iz1oBAulcrksAIhB/kSZTEYAEMlk0pNumqYAIPM1TROZTEbml8tloWma3Na2bZFMJmVaPp/37E+49pnP5wUAYZqmsG1bGIYhj7dcLnu2oXPJZDLCMAyh67qwLEtYliU/yzRNzzZCCFGr1WS+ZVnCtm2RyWR6Hp9lWfJck8mkyGQyQtd1oWma0HXd9/gYG5Zt24HX7WGjcnbYqNwexTkyr8P/a7N9GSZQF4tF3wJFQVXTtK7g1Gw2ZcArFouePHeArNVqnjz3PtW8XC4nAAhd1z3p7nOxLMuT12w2ZQBVAy99lvo57m0cx5HpdPOgaZonXbgCODhQsxHgQM3GiR99HwMPPfQQAKBUKqlZAIAPP/wQsVjMk3br1i20221YloVz58558k6dOoW33noLALrej5H33nsPp06d8qQtLCzIx/B+j8ANw8DKyoonbWpqCr/61a8AAB988IFMr9frKJVKSCaTXZ8zNTWFS5cuAZ3zILT9ysoKZmZmZDql0aNExo5KNpv1vC6KRqOBr2boPXiv1zyqVqvV9epnbm6u67WYW71ex/z8vFx/dna25/ps/DhQPwDUQAcAH330EQDgueeeU7MAAN/+9rcBAFtbW2oWAHQFd2IYBgDg888/V7Nw/vx5NQkA8NhjjwEAqtWqTKPt79+/j2w227XQjcC///1vuQ0d61NPPSXT3GZnZ9UkxsYmkUggnU6j3W7Dtm3Yto2nnnoKq6urMAzDN1gPo9Vqyc+YnZ2Vn9Fut3HhwgWkUil1E9TrdZw5cwbr6+swDAOZTAaGYeDChQu4ePGiujo7KupPbBYOo3z07Yfyej0G9vv8XvsUrkeCtm3LNDoXd5pK/SzaT7/FvU91H6pBzpmxQQz76LtWqwkAwjAM0Ww2PXnUhkQtH/QZ6vUadJ3T+rlcTs0SyWRSwOf1Er1CcrddEcrrr0HPkR0e/kV9DHz88ccAgBMnTqhZE8+2bXTaUvguS0tL6iaMjc2nn36KRCLhu1y5ckWuR6+lzp8/j6mpKdcegO9+97sAgO3tbU/6sN5++21omoaFhQU1C5cvXwZ8Xi85jgNd13Ht2jXX2l8+hVNfU7Gjw4H6GFhfXwcAvPjii2pWIHqH+8UXX6hZQKdbFFxdRAb12WefAcBQ3VLoMbb7s2j7e/fuyTQ3GgzC/S5c0zTAdeyMHTYaJdBv+dvf/ibXW1paOtQby0ajgXa7jWg02vWaKJvNyhuF+/fvy20++eQTAMDzzz8v09weeeQRNYkdEQ7UEy6VSsm74vn5eTU7EL2bpnfVqk8//RToUYj9GovB9Z6Y3ju7URBXUYVB77fhes8c9I781q1biMfjeP/992UaHSsdu2qQcdEbjYbs881YP6Zpdj3loWVzc9OzbqvVwtraGhKJhKexVzwe96y3H//6178AAI7jIJ1O+y4qat/xjW98Q80aiN+4DOxwcKCeUNRSc3V1FQBw48YNdZWezp49C03TsLq62hV0aRAUAHj55Zc9eeSVV17p+uWaSqXQbrdhmqZvA7b19fWuVuT1el0+lnv11Vdl+szMDEzThOM4XY1gWq0Wfv3rXwMAnn32WZlOTxTeeOONrqCcSqU8jdWC3LlzBwDw+OOPq1mM7Vu9XsfJkyexuLiIEydO4NKlSyiXyyiXy8jlcurq+9brxsHv5mG/Wq0WHMfBk08+qWaxw6C+tGbh4G5MZpqmZ6FGHgjoI036NfyiRmjYx4AnNFgJDSpCA5749a+mc6F9G4YhMpmM57PU/tVC6S9tGIawbVvYti3T/Lah49M0TQ6qQgOeUIOaoO+LsUEN25iMyodfmQrqrzxMYzLHcQR8xjBwK5fLnn3RuAd+5Uj0OC42fsG1ODtS7kCtLpqmCdM0RS6X62pB6tYvUItO6073CGHoBFS1ciDufbqDJgVCdaARobT6rtVqMmCiE4D9WqmSZrMpRxdzb+NX4YnO+upxWZYl0+FT8TE2rGEDNV27foJ6bQRdr0H78hsEiFAgNwxDplFL9KDgHnRcbPy6/9qM9TBI8Fe5A3WYUQXIFRPrZ9hATTfCfk+bgrpBBQVqWl8NyLS+XxcwGplP7YZFwV29We7VPYvqgGHrAbZ//I6asY5mswkA+M53vqNmMXYgr7/+OgDgzJkzuHLlCrLZLBKJBOLx+FCNQAHI9Z9//nlPo8elpSUkk0lUq1U8/fTTns+hQVV++ctfuvb0/21bFhcXMTc3J9c/ffp04HFtbm7CMAyYpqlmsUPCgZqxDhoN7ZlnnlGzGDuQpaUl2LaNaDSK5eVlvP322zhx4gRqtRquXr2qrt7T1atXYVkW9vb2ulpzFwoF5HI5aJqG5eVlpNNp7O7uwrZtbG5udvXhjsViKJfLMsCn02m0Wi0Ui8We/air1Srf0I5RRHz5yI+xgSQSCZRKJQxz2VQqFcTjcdi2fWj9SEchm80inU6j2Wx2VWiMsS9ReS4Wi4FDCbPR4kDNWEcikUCr1cLOzo6axRjr4Bva8eNH34x1lEqlwAFeGGNf2t7ehmEYHKQHUK/XEYlE1OShjTRQ00g76rRsiURC5vVa1O3gmrZN3UcikcDa2lrXwBaM7Qdde0888YSaFQqFQgHRaNS3jAxiY2PDM73i3NwcNjY21NUY66tUKmFubk5NnlhZZVrQ+fn5rkGgesl2piP1W06fPj2SRncjDdT96LoO0zQDl6985Sue9Tc2NnDy5Emk02mUSiXPuqVSCYuLizh58iRXOOzAbt++DfSYIvOoVCoVzM3N4cKFC2i322r2QAqFAl544QXMzc2h2WzCcRwAwAsvvMDzDrOh0I1i0PS4kyaVSiGdTuPSpUsQQqBcLmNrawtnzpwZKlgfOrW/1kFQ3zq13x/1uxumH6171KxMJtPVL7DZbMq+gfDpn8jYMAzDCBz44Sg4jiNM0xS6rnsGiFHLVj9+A1240+HTH3fUhi37LLxoSs7DvmbGIZ/Py/jilz5ofUD91w/TWH9RD6rVauHHP/4x0Jnm8Nq1a13vQ6amprCysgLLsgAAb731liefsUEVCgVUq1X85je/UbOOzPXr1/GDH/wAu7u7B/rV++677wKdsdndaCx1dCY4Yayfer2O1dVVZDIZOfveJKP5Al566SVPOvUfdxwnNL+qQxmob926hXa7DV3X+3bnee211wDXVI+MDWptbQ2RSARvvPEGcrlcqLqarKys+M4rPCwqF9/61rfULNkP9s9//rOaxZhEDaLOnDkDy7K6Bk2ZRI1GQ07S4zeBEN3E0vSgRy2UgZomN08mk2pWl5mZGTkLDWPDWFhYgBACu7u7IwmKYUTvox9++GE1S875HTQtKGPoBDIhBPb29nyfbk4imhZU13U1C+jEFfSYmnfcQhmo6U7HPYVhL7FYDLFYTE3eF7V1+aALz8vKwsbdQtzvUeUjjzwCAPtupMbYpKLGo7Ozs2oW4Jqj+/79+2pWoFQq1dV6fL+9NFRjDdTvvPMOEomE7+J+D0e/Ah566CHX1uOxubkJmrt1mGVU87wyxhibHI8//jgMw8D3vvc97O7uQgiBfD6Pra0txOPxA7UxIWMN1I7joFQq+S7//Oc/1dWPHfVXOC/HZ+lnbm7uQP2g2WDUvwsvx2fp56jK2Llz57Czs+Np4zI/P4+bN28CnV/aBx3vY6yB2rbtrl+itLgbjWmaBgD4z3/+49p6eI1GA5FIxDPDDGPjVqlUUK1W0W635SO346bXoA8AkE6nu9IHrYAZ6yeMZSwWi0HTNLTb7QP3rBhroB4UDeP48ccfq1m+stksstksGo2GJ/3OnTtA59FEGKg3J7wcn6WXWCwGwzCgadpYZ+Z67LHH5L97/cowDENNGtrS0lLXd+L+bnrdpI+Sum9ejs/Sy7BljGJCv1bdTz75pJo0FBpA6aBPjEMZqF988UVgwC5XjUYD6XQa6XS6q8HMuXPnIIQYqtsNNyZjh2FnZwd7e3sja/Q4iKmpKfl0yg/98ghq+crYJBmmjH3ta19Tkzy2t7cBV8+IoxbKQH327FlomgbHcfo+tr5+/TowYFeuQXBjMnac0NMpv8eB9+7dA1w3xow9KE6dOiVvYv2eNu3u7gKdWNRLpVJBJBIJHMaauj4e9KluKAP11NQU3nvvPaDzbuvKlStdL+NbrRauXLmC1dVVaJqGy5cve/L5ly57kLjfEbu9+uqrQKfHhVur1UKhUICu63IkJsYeJK+//joA4Pe//70nvVKpwHEcWJbleUobVMYA+I5qWKlU5MBd6lPdaDSKSMBEVH5CGajReWxdLBahaRqWl5cxPT3t6c41PT2N5eVlaJqGmzdvdo0u02w2AdfoS4xNqrW1Nfnvv/71r568fmKxGCzLguM4svVpq9XCm2++iXa7jRs3bqibMPZA+OlPfwrDMLC6uirLWKPRwMWLF6FpGq5evapuEqhUKiGbzcoflBsbG/j+978PTdPwpz/9SV196Ml/Qhuo0QnWd+/ehW3bcsYsWkzThG3buHv3ru87ic8//xwABmpYwFjY1Ot1ede9uLgo05eXl+Vd/aD9M1dWVmDbNra2tjA9PY3p6Wns7e2hVqv5lh3GHgRTU1PY3NyEZVm4fPkyIpEIDMPA3Nwc7t69O9AIbLFYDMViEZZl4Z133sH09DQikQh+8YtfYH5+HtVqtetHJFyP1tUZI4NERL/mdBMqm80inU6j2WwO9IUfpWg0ina7DcuysLKyomazMaNHW8e0aIxdJBKBbdt9x+0/LPV6HadPnwYA5PN5ftQfAg9yGWu1WpienoZhGNjZ2VGzfYX6F/VBbG9vwzCM0AfpjY0NOYTj6upq17t4t0qlgmw2G/heo1Ao+HZTm3T9zpuFm1DGSRi3999/X/6bZkwK0u9a4zLGDurdd9+Fpmn47W9/q2YFOraBulQqyRavYfa73/0OcM3W0qtj/O3bt5FOp31b8ALAjRs3kE6n5YDzx0W/82asF2pwahgGqtVqzyDb71rjMsYOotVq4Q9/+AM+/PBD30fiQY5loKa7wieeeELNCpVWq4X19XXoug7btoFORcAYGw16YjU/Py/n5P7jH/+orsbYWExNTWFnZ2eoIA18+Vjq2LFtWwAQjuOoWaGSz+cFAJHJZIQQQui67nvcdD5+S7lcFqZpdqXTosrn88IwDJmv67qwbVs0m011VblfIYQoFoue7ZLJpKjVakIIIRzHEZZlCU3T5D5zuZyyNyHK5bIAID9P3ca2bc/6/c7brVwui2QyKfM1TROWZXV9l0T9fE3T5HHRPtjksyxLABDFYlE4jiOvNVW/a23QMtZsNoVt27IsAxCGYYh8Pu9Zj0xSGRum7hBcxkbqWH5ThmH4FsawoYueCmMmkxEAugpguVwWtm3LQm2aprBtW9i2LRzHEfl83lM5WJYl892o0qIC5t6nYRhdBY7yaDvTNIVpmvJzdF0X+XxeaJomdF2X+VQIi8WiZ39UiWQyGfk3sizLU5hN0/Ss3+u8Cd3waJomMpmMsG1bHrOmafL7Jc1mU3736jG4KyI22SggaJom09QyR/pda4OUMfd15d7evY1qUsrYsHUHl7HROnbfFFXa6gUcNn5397VarSvNze7c/borBzcqOOqdsHAV4GQyqWaJXC4n4FOR0P78gh1VJPC5saD9qZ9Fx+D3Wc1mU+5T/fXR67ybzaasxNTKIuj7dFeK7m3clQu4Epl4VBe4r7Wga530utZEnzJG26rlQQghn/ao1/YklLH91B1cxkbr2HxTdMHoAY+EwoZ+PdNjb0IFSS20ok9hEn0qEaoo/PYrhBCapnl+eQjX/vy2oWNRj18E/JIRrgJvGIYnnVDFqlYIvc6b/u5Bf3O/86ZfFn6PxekGClyJTDwKCO6bdvr7qtcm6XWtiT5lzK8MEbppVK/tSShjfmXIze+8uYyN1rFpTLawsAAhBHZ3d7GwsKBmhw5NOPLSSy950mnMcneXklGoVquAawQddaG+3H56NXz46le/qibJLnFB+zt//ryaBLhme6JjHUS9Xpf/Vc8pm83i/v37AID//ve/cj0a1k+dxAWAbxqbPI1GA9VqFZqmeYZvnJmZgWEYaLfbgeMz70ej0UC73UY0Gu26BrPZrJylia5HVZjL2LB1B5exQ6BGbnb46K5XfSQrXHfe6h2q6HPXK/rc7dMdbL/Fjfbnp9+x+O2PzjtoGxGwXa/PomPst9B3Qsdgut7TqfyOgU0WumbUR7Kix2Nj0edaEz3KmPuRc69Fve4moYyp5xC0EC5jo3dsflFPEhoEfm9vzzN+eSKRQDqdBjp3yqO84yed1x2By6Qql8td5+JeeKjMBwtNQrK1tdVVxmigifX19Z4DDO2HaZpd1557meRZ9tRzURd2eDhQjxnNWoROMHaPX04L+ctf/uLa8mBozuGgwR7q9ToqlcrIK65h0GPsYeZHpsdoX3zxhZoFdM63UqnI86axdWmsXXb80OxHAOA4Tlf5cj/27TXA0DAefvhhoM91ValUjnzkr/2UsWHrDi5jo8eBesxu3bol39+od6S01Go1YIAhRYdBo7TR/KiqH/7wh4jH42ryofjss8/UJADAJ598AgAwDEPNCvTcc88BAD766CM1C+jMVx6Px3Hnzh3ANQ+t4zi+Fc+g33ehUBhqmjo2PvTEyrKsrrJFSy6XAwYYUnRQMzMz0HU98LpqNBqIx+O4ePGimnUoRlnGhq07uIyNHgfqMfvggw8AAD/72c/ULOnUqVOyILnv+B999FEAwL1792SaW69fly+//DIAIJVKybtqUigU4DjO2MZGX19f90zdiM5dOc0pTnMok17nffbsWWiahtXV1a7ZpOr1ukx7+umnZTpNypDJZGQaOhXIoPOX//3vfwdcjXNYeNDfnK55Pz/60Y+ATkMpdzDpda2hTxmjMp1MJruC0fXr1wFX0Dtsoyxj+6k7uIyNmPrSmh0ed7cEv24LbtTgxd0gg/oMwzXggns/1IhDcw384Ub71Dqjdtl9BgY5rIYuyWRSaJ2BDzKZjPx/Oi/VIOdN+clkUti2LTKZjExT+4y696fruucYDMOQ3XrY5KHuR34NNVXU7ch9DQ9yrVF58StjtE+6ruw+A4NMShkbtu7gMjZa/E2NUa/WpirqJwklqKvDDKqtT21l+EJVsViUlYm74PndOBxWJWLbtqjVap7jMAwjsC+0GOC8a7WasFyjL6HzPavrEcdnSEY6l17nzcKNrqle1xIpFovyb6+m97rW+pWxXC7n2Z6uLTVIiz7XWtjK2DB1h+AyNlL8TbGxGaTrSNjROUz6ebDjicvY8cTvqBkbQiwWQ7FYBAA888wzajZj7IC4jHXjQM3YkP7xj38AnQqFMTZ6XMa8OFAzNqTt7e2hurcwxobDZcwrIr5slMAYG1AkEkEmk8G1a9fULMbYCHAZ8+Jf1IwNgQZfePbZZ9UsxtgIcBnrxoGasSHcvn0bAPDNb35TzWKMjQCXsW4cqBkbwvb2duD0fYyxg+My1o0DNWMDarVaKJVKYxsGkrEHDZcxfxyoGRvQm2++CU3T8Nprr6lZjLER4DLmjwM1Y33Mz88jEolgZ2cHN2/e5EdyjI0Yl7HeuHsWY4wxFmL8i5oxxhgLMQ7UjDHGWIhxoGaMMcZCjAM1Y4wxFmIcqBljjLEQ40DNGGOMhRgHasYYYyzEOFAzxhhjIfY/hQorgLcfeDUAAAAASUVORK5CYII=\" width=\"490\" height=\"81\"\u003e\u003c/p\u003e\u003cp\u003eand for student \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:i\\)\u003c/span\u003e\u003c/span\u003e,\u003cdiv id=\"Equb\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equb\" name=\"EquationSource\"\u003e\n$$\\:{\\text{IFC}}_{\\text{mean}}^{\\left(i\\right)}=\\frac{1}{\\mid\\:{C}_{i}\\mid\\:}\\sum\\:_{j\\in\\:{C}_{i}}{\\text{IFC}}_{j},$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{C}_{i}\\)\u003c/span\u003e\u003c/span\u003eis the set of courses attempted by student \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:i\\)\u003c/span\u003e\u003c/span\u003eup to the VOT cut-off.\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eKey insight\u003c/strong\u003e\u003cp\u003eIFC-based features dominate the importance ranking in our empirical models, confirming that friction\u0026mdash;rather than grades alone\u0026mdash;captures critical aspects of curricular vulnerability.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eN4: Trajectory dynamics (temporal processes)\u003c/b\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eTheoretical grounding\u003c/strong\u003e\u003cp\u003elife-course approaches (Elder, 1998); state-transition and Markov chain perspectives on educational progression.\u003c/p\u003e\u003c/p\u003e\u003cp\u003eN4 features encode how students move through time, rather than what they look like at a single snapshot:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eEnrolment gaps\u003c/b\u003e: longest gap between consecutive active terms or course enrolments.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eLoad trends\u003c/b\u003e: change in course load over time, often modelled as the slope in a simple regression of courses-per-term on term number.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eState entropy\u003c/b\u003e: diversity of academic states (passed, failed, dropped, not attempted), computed as Shannon entropy\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cdiv id=\"Equc\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equc\" name=\"EquationSource\"\u003e\n$$\\:H=-\\sum\\:_{s\\in\\:S}p\\left(s\\right){\\text{l}\\text{o}\\text{g}}_{2}p\\left(s\\right),$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:S\\)\u003c/span\u003e\u003c/span\u003eis the set of states and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:p\\left(s\\right)\\)\u003c/span\u003e\u003c/span\u003etheir empirical probabilities.\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eVelocity of advance\u003c/b\u003e: ratio of completed courses to those expected by the nominal curriculum at each time point.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eKey insight\u003c/strong\u003e\u003cp\u003eN4 variables accoUniversity X for several of the top predictors in our feature importance analyses, underscoring that temporal structure\u0026mdash;interruptions, non-linear progress, and volatility\u0026mdash;contributes information not captured by static indicators.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eCross-level interactions\u003c/b\u003e\u003c/p\u003e\u003cp\u003eEducational processes are fundamentally interactive: the effect of one feature often depends on another. Rather than relying solely on hierarchical models, CAPIRE engineers\u0026rsquo; interaction terms guided by theory. Examples include:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN1 \u0026times; N3\u003c/b\u003e: socio-economic deprivation \u0026times; pass rate, to model how poverty amplifies the impact of academic failure.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN2 \u0026times; N4\u003c/b\u003e: age at entry \u0026times; average number of attempts, capturing heightened sensitivity of older students to repeated failure.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN3 \u0026times; N3\u003c/b\u003e: combinations of friction and withdrawal rates, representing compound academic risk.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN3 \u0026times; N4\u003c/b\u003e: exposure to high-IFC courses \u0026times; maximum gap, reflecting the vulnerability of interrupted trajectories in demanding curricula.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eEmpirical analyses (Section \u003cspan refid=\"Sec56\" class=\"InternalRef\"\u003e7.5.4\u003c/span\u003e) show that such interaction terms accoUniversity X for a substantial proportion of total model gain, indicating that multilevel thinking is not merely conceptually elegant but empirically necessary.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003e3.3. From risk scores to trajectory archetypes\u003c/h2\u003e\u003cp\u003eMost early-warning systems produce individual-level risk scores (e.g., an estimated probability of dropout within the next year). While useful for prioritising outreach, these scores suffer from three limitations:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eLimited explanatory power\u003c/b\u003e: a high risk value rarely clarifies whether the underlying mechanism is academic friction, financial stress, social isolation, or misalignment of expectations.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eHomogenisation of heterogeneity\u003c/b\u003e: students with similar predicted risk may require very different forms of support.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eWeak link to practice\u003c/b\u003e: risk scores do not specify concrete, differentiated actions.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eCAPIRE shifts focus from isolated probabilities to \u003cb\u003etrajectory archetypes\u003c/b\u003e: empirically derived groups of students who share similar N1\u0026ndash;N4 profiles and, consequently, similar mechanisms of vulnerability and response to support.\u003c/p\u003e\u003cp\u003eConceptually, CAPIRE reframes the question:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003efrom \u003cem\u003e\u0026ldquo;How likely is this student to withdraw?\u0026rdquo;\u003c/em\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eto \u003cem\u003e\u0026ldquo;Which trajectory pattern is this student following, and what typically happens to students on this path?\u0026rdquo;\u003c/em\u003e\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eArchetypes are obtained through unsupervised learning in the VOT-compliant feature space, combining dimensionality reduction (UMAP) with density-based clustering (DBSCAN). Each archetype is then characterised along three axes:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eStructural profile\u003c/b\u003e: distributions of N1\u0026ndash;N2 features (e.g., socio-economic background, age at entry).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eCurricular and friction profile\u003c/b\u003e: N3 patterns (e.g., high IFC concentrated in core mathematics courses).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTemporal profile\u003c/b\u003e: N4 dynamics (e.g., early gaps, late deceleration, high entropy).\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThis structure enables:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eInterpretability\u003c/b\u003e: archetypes can be summarised in natural language as recognisable patterns (e.g., \u0026ldquo;early structural overload in gateway mathematics\u0026rdquo;).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eHeterogeneity-aware risk\u003c/b\u003e: each archetype has its own attrition rate and typical progression pattern.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eActionability\u003c/b\u003e: archetypes map onto specific institutional levers (e.g., strengthened tutoring in particular courses, targeted financial advice, adjustments to curriculum sequencing).\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe term \u0026ldquo;archetype\u0026rdquo; is used in a pragmatic sense: not as an essentialist label, but as a recurring configuration that simplifies complexity without erasing relevant variation.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003e3.4. CAPIRE within institutional decision-making cycles\u003c/h2\u003e\u003cp\u003eCAPIRE is conceived as a sociotechnical system embedded in routine institutional processes rather than as a standalone predictive tool. Figure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e3.2\u003c/span\u003e summarises its role within decision-making cycles.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eIn a typical deployment:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eData integration and feature extraction\u003c/b\u003e: SIS and LMS data are periodically ingested, cleaned, and transformed into N1\u0026ndash;N4 features under a configured VOT.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eArchetype assignment\u003c/b\u003e: students are assigned to archetypes based on their current feature vectors, with risk and mechanism profiles updated over time.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eAdvisory use\u003c/b\u003e: academic advisors, programme coordinators, or retention committees access dashboards that display archetype distributions, key features, and historical outcomes.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eIntervention design\u003c/b\u003e: archetype profiles inform targeted actions (e.g., small-group tutoring, mentoring schemes, counselling referral paths), including the intensity, timing, and modality of support.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eFeedback and learning\u003c/b\u003e: outcomes of interventions feed back into subsequent training cycles, allowing institutions to monitor whether archetype distributions and associated risks change over time.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eSeveral design choices are essential for responsible integration:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eHuman-in-the-loop\u003c/b\u003e: archetype assignments are recommendations, not prescriptions. Staff can override or re-interpret assignments based on qualitative knowledge that lies outside administrative data.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTransparency\u003c/b\u003e: explanations are available at both group and individual level, enabling staff to understand why a student was mapped to a given archetype.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eEthical framing\u003c/b\u003e: archetype labels are used internally by staff; communication with students focuses on supportive offers (e.g., \u0026ldquo;we have observed difficulties in specific courses and can provide tailored support\u0026rdquo;) rather than categorical classifications.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eIterative refinement\u003c/b\u003e: as institutional policies, curricula, and external contexts evolve, the CAPIRE pipeline can be recalibrated and re-trained, preserving alignment with local realities.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eBy anchoring analytics in interpretable archetypes, CAPIRE transforms predictive models into institutional learning tools, supporting a move from ad-hoc interventions to systematically designed, evidence-informed retention strategies.\u003c/p\u003e\u003c/div\u003e"},{"header":"4. MULTILEVEL FEATURE ENGINEERING IN CAPIRE","content":"\u003cp\u003eThis section translates the conceptual CAPIRE framework into a concrete feature dictionary: a set of 44 empirically tested variables spanning levels N1\u0026ndash;N4. We describe the underlying data model, the construction logic for each feature family, and the design criteria that allow other institutions to adapt the framework to their own contexts while preserving temporal validity and interpretability.\u003c/p\u003e\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\u003ch2\u003e4.1. Data model and entities\u003c/h2\u003e\u003cp\u003eCAPIRE operates on a relational data model with five core entities, common to most student information systems:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eSTUDENT\u003c/b\u003e: one record per individual.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eENROLMENT\u003c/b\u003e: one record per student\u0026ndash;course\u0026ndash;term combination.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eCOURSE\u003c/b\u003e: curricular units with associated metadata (e.g., department, level, credits).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eCURRICULUM\u003c/b\u003e: programme-specific course sequencing and recommended load.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eSEMESTER/TERM\u003c/b\u003e: temporal index enabling alignment with macro-level indicators.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eIn this model:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eA \u003cb\u003eSTUDENT\u003c/b\u003e has many \u003cb\u003eENROLMENT\u003c/b\u003e records.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eEach \u003cb\u003eENROLMENT\u003c/b\u003e references one \u003cb\u003eCOURSE\u003c/b\u003e.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eEach \u003cb\u003eCOURSE\u003c/b\u003e is associated with one \u003cb\u003eCURRICULUM\u003c/b\u003e.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eEach \u003cb\u003eENROLMENT\u003c/b\u003e belongs to one \u003cb\u003eSEMESTER\u003c/b\u003e, which is linked to calendar time (for N4 features).\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe \u003cb\u003eoutcome variable\u003c/b\u003e is a binary attrition flag:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eattrition_flag\u0026thinsp;=\u0026thinsp;1 if the student leaves the programme without graduating within a predefined horizon (e.g., six years).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eattrition_flag\u0026thinsp;=\u0026thinsp;0 if the student graduates or remains enrolled at the end of the horizon.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eCrucially, this outcome is \u003cb\u003enever used as a feature\u003c/b\u003e. All predictors are computed from data that are available strictly up to the chosen Vulnerability Observation Time (VOT), ensuring that no post-hoc information leaks into the feature space.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\u003ch2\u003e4.2. N1 features: pre-entry socio-economic context\u003c/h2\u003e\u003cp\u003e\u003cb\u003ePurpose.\u003c/b\u003e N1 features capture structural conditions that shape the resources, expectations, and constraints students bring into higher education.\u003c/p\u003e\u003cp\u003eTypical N1 variables include:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eNeighbourhood deprivation (e.g., NBI_localidad)\u003c/b\u003e: an index derived from census data at the postcode or census-tract level, summarising poverty, overcrowding, access to basic services, and educational infrastructure.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eDistance to campus\u003c/b\u003e: geodesic or travel distance from the student\u0026rsquo;s home area to the institution, used as a proxy for commuting burden and integration costs.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eLocal labour-market indicators at entry (desempleo_zona_t0, informalidad_zona_t0, pobreza_zona_t0)\u003c/b\u003e: unemployment, informality, and poverty rates for the student\u0026rsquo;s locality in the year of enrolment.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eFamily educational capital (nivel_educ_padres, hermanos_universidad)\u003c/b\u003e: parental education and whether siblings have attended university.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eSecondary-school type (tipo_secundaria)\u003c/b\u003e: public vs. private vs. technical, as a coarse indicator of prior institutional context.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eWhere necessary, N1 features are complemented by simple interaction terms (e.g., deprivation \u0026times; pass rate) to capture how socio-economic context modulates academic outcomes.\u003c/p\u003e\u003cp\u003e\u003cb\u003eMissing data handling.\u003c/b\u003e Census-derived variables are typically complete at area level; when gaps exist, median imputation within region/province is used. For self-reported data (e.g., parental education), CAPIRE explicitly encodes missingness with separate binary indicators to avoid silently conflating \u0026ldquo;unknown\u0026rdquo; with any substantive category.\u003c/p\u003e\u003cp\u003e\u003cb\u003eAdaptation.\u003c/b\u003e Outside Country Q, analogous indices (e.g., US Census tract data, Index of Multiple Deprivation in the UK) can be substituted without altering the overall N1 logic.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\u003ch2\u003e4.3. N2 features: entry moment characteristics\u003c/h2\u003e\u003cp\u003e\u003cb\u003ePurpose.\u003c/b\u003e N2 features describe the student at, and immediately around, the point of first enrolment. They are conceptually and temporally anchored at \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{t}_{0}\\)\u003c/span\u003e\u003c/span\u003e(entry).\u003c/p\u003e\u003cp\u003eCore N2 variables include:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eAge at entry (edad_ingreso)\u003c/b\u003e: a continuous measure that differentiates traditional and non-traditional entrants.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eGender and other demographic flags\u003c/b\u003e: used primarily for fairness monitoring and descriptive analysis.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eEmployment status at entry (trabaja_al_ingreso)\u003c/b\u003e: when available, indicates potential time and cognitive constraints.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eUpper-secondary performance (promedio_secundaria)\u003c/b\u003e: high-school GPA or equivalent exam score.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eMacro-economic conditions at entry (IPC_interanual_t0, strikes in the 24 months prior to\u003c/b\u003e \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{t}_{0}\\)\u003c/span\u003e\u003c/span\u003e\u003cb\u003e)\u003c/b\u003e: inflation and major disruptions to schooling, aligned to the calendar year of enrolment, not averaged across cohorts.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eAll N2 features are computed using data that are, by definition, available at the moment of first enrolment. This ensures that the observation window is properly anchored and that N2 plays the role of \u003cem\u003einitial conditions\u003c/em\u003e in subsequent analyses.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e\u003ch2\u003e4.4. N3 features: academic performance and curricular friction\u003c/h2\u003e\u003cp\u003e\u003cb\u003ePurpose.\u003c/b\u003e N3 features capture how students engage with the curriculum within the VOT window. They describe both \u003cem\u003ewhat\u003c/em\u003e students have attempted and \u003cem\u003ehow\u003c/em\u003e those attempts have unfolded.\u003c/p\u003e\u003cp\u003eTypical N3 variables include:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eVolume and outcomes of coursework\u003c/b\u003e:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003etotal number of courses attempted up to VOT,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003etotal passed, failed, and dropped,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003epass and failure rates within the window,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003emean and median grades, along with variability measures.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eCurricular friction indicators\u003c/b\u003e:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003ethe \u003cb\u003eInstructional Friction Coefficient (IFC)\u003c/b\u003e at course level, defined as a weighted combination of withdrawal and failure rates;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ethe student-level \u003cb\u003emean IFC\u003c/b\u003e across all attempted courses;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eexposure to \u0026ldquo;filter\u0026rdquo; or \u0026ldquo;gateway\u0026rdquo; subjects\u0026mdash;courses whose IFC exceeds a pre-specified threshold.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eIn the FACULTY B-UNIVERSITY X case study, filter courses include high-impact mathematics and physics subjects that historically concentrate failure and withdrawal.\u003c/p\u003e\u003cp\u003eFormally, for each course \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:j\\)\u003c/span\u003e\u003c/span\u003e,\u003cdiv id=\"Equd\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equd\" name=\"EquationSource\"\u003e\n$$\\:{\\text{IFC}}_{j}={w}_{1}\\cdot\\:\\frac{{\\text{Dropped}}_{j}}{{\\text{Attempted}}_{j}}+{w}_{2}\\cdot\\:\\frac{{\\text{Failed}}_{j}}{{\\text{Attempted}}_{j}},$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003ewith default weights \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{w}_{1}=1.0\\)\u003c/span\u003e\u003c/span\u003eand \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{w}_{2}=0.5\\)\u003c/span\u003e\u003c/span\u003e, so that withdrawals are treated as a stronger signal than failures. For student \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:i\\)\u003c/span\u003e\u003c/span\u003e, the aggregated friction is\u003cdiv id=\"Eque\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Eque\" name=\"EquationSource\"\u003e\n$$\\:{\\text{IFC}}_{\\text{mean}}^{\\left(i\\right)}=\\frac{1}{\\mid\\:{C}_{i}\\mid\\:}\\sum\\:_{j\\in\\:{C}_{i}}{\\text{IFC}}_{j},$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{C}_{i}\\)\u003c/span\u003e\u003c/span\u003ecomprises all courses attempted by student \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:i\\)\u003c/span\u003e\u003c/span\u003ebefore the VOT cut-off.\u003c/p\u003e\u003cp\u003e\u003cb\u003eVOT compliance.\u003c/b\u003e All N3 features are computed from ENROLMENT records whose dates fall within the interval \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:[{t}_{0},{t}_{0}+{T}_{V}]\\)\u003c/span\u003e\u003c/span\u003e. Courses taken after this window are invisible to the feature extractor, even if they are present in the database, thereby preventing temporal leakage.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec18\" class=\"Section2\"\u003e\u003ch2\u003e4.5. N4 features: trajectory dynamics\u003c/h2\u003e\u003cp\u003e\u003cb\u003ePurpose.\u003c/b\u003e N4 features encode the temporal structure of each student\u0026rsquo;s progression. Rather than summarising only coUniversity Xs and averages, they describe \u003cem\u003ehow\u003c/em\u003e events are distributed over time.\u003c/p\u003e\u003cp\u003eKey N4 variables include:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eAverage attempts per course (intentos_promedio_ventana)\u003c/b\u003e: distinguishing between students who typically pass on first attempt and those who accumulate retries.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eMaximum gap between enrolments (gap_maximo_entre_cursadas)\u003c/b\u003e: the longest period without active course participation, measured in terms or semesters.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTrend in course load (tendencia_carga)\u003c/b\u003e: the slope of a simple regression of courses-per-semester on semester index, indicating whether students accelerate, maintain, or gradually reduce load.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eVelocity of advance (velocidad_avance)\u003c/b\u003e: the ratio between completed courses and the number expected by the nominal curriculum at VOT.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eRegularity of progression (regularidad_cursado)\u003c/b\u003e: variability in the spacing of enrolments.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eState entropy (entropia_de_estados)\u003c/b\u003e: diversity in course outcomes (passed, failed, dropped, not attempted) up to VOT, computed as Shannon entropy\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cdiv id=\"Equf\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equf\" name=\"EquationSource\"\u003e\n$$\\:H=-\\sum\\:_{s\\in\\:S}p\\left(s\\right){\\text{l}\\text{o}\\text{g}}_{2}p\\left(s\\right),$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:S\\)\u003c/span\u003e\u003c/span\u003eis the set of states and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:p\\left(s\\right)\\)\u003c/span\u003e\u003c/span\u003etheir empirical frequencies.\u003c/p\u003e\u003cp\u003eHigh entropy indicates erratic trajectories (mix of passes, failures, and withdrawals), whereas low entropy reflects consistent patterns (predominantly success, or predominantly failure/withdrawal).\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec19\" class=\"Section2\"\u003e\u003ch2\u003e4.6. Interaction features and composite indicators\u003c/h2\u003e\u003cp\u003eEducational processes are rarely additive. The impact of academic friction depends on socio-economic context; the impact of age depends on patterns of enrolment and gaps. Linear main-effects-only models systematically miss such conditional structures.\u003c/p\u003e\u003cp\u003eCAPIRE therefore incorporates a curated set of interaction features and composites, guided by theory and validated empirically. Examples include:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eFriction \u0026times; withdrawal rate\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{IFC}}_{\\text{mean}}^{\\left(i\\right)}\\times\\:\\text{tasa_libre}\\)\u003c/span\u003e\u003c/span\u003e: captures compound academic risk when students repeatedly drop courses with high structural difficulty.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eAge at entry \u0026times; attempts\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{edad_ingreso}\\times\\:\\text{intentos_promedio}\\)\u003c/span\u003e\u003c/span\u003e: models the idea that older students may be less resilient to repeated failure due to higher opportunity costs.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eDeprivation \u0026times; pass rate\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{NBI_localidad}\\times\\:\\text{pass}\\ \\text{ratio}\\)\u003c/span\u003e\u003c/span\u003e: represents how poverty may amplify the consequences of academic setbacks.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eFilter exposure \u0026times; maximum gap\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{exposicion_filtros}\\times\\:\\text{gap_maximo}\\)\u003c/span\u003e\u003c/span\u003e: reflects the vulnerability of students who both face demanding subjects and experience interruptions.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThese interactions are intentionally limited in number\u0026mdash;focusing on theoretically plausible combinations\u0026mdash;to avoid combinatorial explosion and overfitting. In our empirical models, they accoUniversity X for a disproportionately large share of predictive gain relative to their number, reinforcing the importance of multilevel thinking.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec20\" class=\"Section2\"\u003e\u003ch2\u003e4.7. Feature dictionary and design criteria\u003c/h2\u003e\u003cp\u003eThe complete CAPIRE feature dictionary comprises 44 variables:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN1\u003c/b\u003e: 12 pre-entry features (structural and socio-economic).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN2\u003c/b\u003e: 6 entry-moment features (demographics, preparation, macro-context).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN3\u003c/b\u003e: 16 curricular and performance features (including IFC-based metrics and course-specific indicators).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN4\u003c/b\u003e: 10 temporal and interaction features capturing dynamics and cross-level effects.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eFeature inclusion follows five design criteria:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTemporal validity\u003c/b\u003e: all features must be computable using only data available at or before the VOT cut-off.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTheoretical grounding\u003c/b\u003e: each feature must be linked to established retention or stratification theories (e.g., Tinto, Bourdieu, Astin).\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eActionability\u003c/b\u003e: features should inform potential interventions (e.g., high IFC flags subjects suitable for pedagogical redesign).\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eMeasurability\u003c/b\u003e: variables must be obtainable from standard institutional systems (SIS, LMS, census or official statistics).\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eNon-redundancy\u003c/b\u003e: highly collinear candidates are pruned to maintain a compact, interpretable set.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eConversely, several commonly used variables in the EDM literature are explicitly excluded when they violate these principles\u0026mdash;most notably cumulative GPA computed post-hoc, total semesters enrolled (which is tautologically linked to attrition), or metrics requiring knowledge of end-of-trajectory outcomes.\u003c/p\u003e\u003cp\u003eBy enforcing these criteria, CAPIRE provides a feature space that is not only predictive, but also temporally honest, theoretically interpretable, and portable across institutions willing to adopt a similar multilevel, leakage-aware approach.\u003c/p\u003e\u003c/div\u003e"},{"header":"5. EARLY OBSERVATION WINDOW (VOT) AND DATA LEAKAGE PREVENTION","content":"\u003cdiv id=\"Sec22\" class=\"Section2\"\u003e\u003ch2\u003e5.1. Defining the Value of Observation Time (VOT)\u003c/h2\u003e\u003cp\u003eThe Value of Observation Time (VOT) is a central design parameter in CAPIRE. Intuitively, the VOT is the latest point in a student\u0026rsquo;s trajectory at which:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eThe institution can still intervene in a meaningful way (e.g., tutoring, curricular adjustments, financial or psychosocial support), and\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eAll data used for risk profiling and archetype assignment are guaranteed to be available in a real operational setting.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eFormally, let \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:t\\)\u003c/span\u003e\u003c/span\u003edenote academic time measured in terms (or equivalent periods), and let \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:T\\)\u003c/span\u003e\u003c/span\u003edenote the end of the programme\u0026rsquo;s nominal duration. The VOT, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{t}_{\\text{VOT}}\\)\u003c/span\u003e\u003c/span\u003e, satisfies:\u003cdiv id=\"Equg\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equg\" name=\"EquationSource\"\u003e\n$$\\:0\u0026lt;{t}_{\\text{VOT}}\u0026lt;T$$;\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eInstitutional interventions launched at or shortly after \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{t}_{\\text{VOT}}\\)\u003c/span\u003e\u003c/span\u003ecan plausibly affect completion;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eAll features \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{X}_{\\le\\:{t}_{\\text{VOT}}}\\)\u003c/span\u003e\u003c/span\u003eare computable using information recorded on or before \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{t}_{\\text{VOT}}\\)\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eIn many long-cycle programmes, a natural candidate for \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{t}_{\\text{VOT}}\\)\u003c/span\u003e\u003c/span\u003eis the end of the first academic year, which frequently corresponds to a peak in vulnerability and dropout. CAPIRE does not, however, hard-code this choice. Instead, institutions select VOT based on:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eempirical attrition curves (cumulative dropout by term or credit band);\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eorganisational response capacity (how quickly support services can act);\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ecurricular structure (timing of gateway or high-friction subjects).\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eOnce \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{t}_{\\text{VOT}}\\)\u003c/span\u003e\u003c/span\u003eis defined, the feature dictionary is partitioned into:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eVOT-admissible features\u003c/b\u003e: available at or before \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{t}_{\\text{VOT}}\\)\u003c/span\u003e\u003c/span\u003eand eligible for early-warning and archetype profiling;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003ePost-VOT features\u003c/b\u003e: potentially useful for retrospective analyses, longitudinal research, or causal evaluation, but \u003cb\u003enot\u003c/b\u003e for models claiming to operate at \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{t}_{\\text{VOT}}\\)\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThis explicit temporal boundary replaces vague formulations such as \u0026ldquo;early prediction\u0026rdquo; with a precise, auditable design constraint.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec23\" class=\"Section2\"\u003e\u003ch2\u003e5.2. Temporal slicing of trajectories and label assignment\u003c/h2\u003e\u003cp\u003eGiven a chosen VOT, CAPIRE adopts a two-axis temporal scheme:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eA \u003cb\u003etrajectory axis\u003c/b\u003e, along which features are accumulated up to \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{t}_{\\text{VOT}}\\)\u003c/span\u003e\u003c/span\u003e;\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eAn \u003cb\u003eoutcome horizon\u003c/b\u003e, beyond \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{t}_{\\text{VOT}}\\)\u003c/span\u003e\u003c/span\u003e, over which completion and dropout outcomes are defined.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eFor each student \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:i\\)\u003c/span\u003e\u003c/span\u003e, we construct:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eA \u003cb\u003efeature snapshot at VOT\u003c/b\u003e:\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cdiv id=\"Equh\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equh\" name=\"EquationSource\"\u003e\n$$\\:{\\mathbf{x}}_{i}^{\\left(\\text{VOT}\\right)}=f({\\text{N1}}_{i},\\text{}{\\text{N2}}_{i,\\le\\:{t}_{\\text{VOT}}},\\text{}{\\text{N3}}_{i,\\le\\:{t}_{\\text{VOT}}},\\text{}{\\text{N4}}_{i,\\le\\:{t}_{\\text{VOT}}}),$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:f(\\cdot\\:)\\)\u003c/span\u003e\u003c/span\u003edenotes the feature-construction rules described in Section \u003cspan refid=\"Sec13\" class=\"InternalRef\"\u003e4\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eA \u003cb\u003elabel\u003c/b\u003e \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{y}_{i}\\)\u003c/span\u003e\u003c/span\u003e, defined on a later interval, for example:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{y}_{i}=1\\)\u003c/span\u003e\u003c/span\u003eif the student drops out at any point before \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:T+{\\Delta\\:}\\)\u003c/span\u003e\u003c/span\u003e, where \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\Delta\\:}\\)\u003c/span\u003e\u003c/span\u003eis a grace period;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{y}_{i}=0\\)\u003c/span\u003e\u003c/span\u003eif the student completes within that window; or\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ea multi-class or time-to-event label (e.g., on-time completion, delayed completion, non-completion).\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eBy construction, no component of \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\mathbf{x}}_{i}^{\\left(\\text{VOT}\\right)}\\)\u003c/span\u003e\u003c/span\u003emay depend on events after \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{t}_{\\text{VOT}}\\)\u003c/span\u003e\u003c/span\u003e. This constraint is enforced at two levels:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eFeature engineering\u003c/b\u003e: all queries and transformations include explicit temporal conditions (e.g., \u0026ldquo;up to and including term \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{t}_{\\text{VOT}}\\)\u003c/span\u003e\u003c/span\u003e\u0026rdquo;), often implemented as time-filtered views of enrolment and assessment tables.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eModel evaluation\u003c/b\u003e: data splitting respects temporal structure. Training and test sets are separated by cohort or time, and preprocessing steps (scaling, encoding, feature selection) are fitted solely on the training partition within each fold.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eCAPIRE supports several temporal strategies:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eSingle-shot early warning\u003c/b\u003e: one snapshot per student at a specific VOT (e.g., end of year 1).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eRolling-window warnings\u003c/b\u003e: repeated VOT snapshots (e.g., after each term), enabling dynamic monitoring of risk and possible transitions between archetypes.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eRetrospective trajectory analysis\u003c/b\u003e: full student\u0026ndash;term sequences with labels attached at the end of the observation period, suitable for survival or transition modelling in later CAPIRE work.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eIn all cases, the strict separation between observation window and outcome horizon provides a clear framework for reasoning about leakage, stability, and fairness.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec24\" class=\"Section2\"\u003e\u003ch2\u003e5.3. Typical leakage scenarios in dropout prediction\u003c/h2\u003e\u003cp\u003eIn the absence of explicit temporal design, leakage often enters attrition models in subtle ways. CAPIRE explicitly identifies several recurrent patterns:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eOutcome-proximal academic features\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eUsing end-of-year or end-of-programme indicators\u0026mdash;such as final GPA, total failed courses, or \u0026ldquo;ever dropped\u0026rdquo; flags\u0026mdash;as predictors in models that purport to provide early warnings. Similarly, using statistics computed over the entire trajectory (e.g., maximum consecutive inactive terms) when the prediction is supposed to occur much earlier.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTemporal aggregation without windowing\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eConstructing features such as \u0026ldquo;total enrolled terms\u0026rdquo; or \u0026ldquo;time since first enrolment\u0026rdquo; from the full record, which implicitly reveals whether the student persisted or left. Likewise, computing mean LMS activity across all courses ever taken and using it as an \u0026ldquo;early\u0026rdquo; predictor.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eLabel-dependent feature construction\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eCreating variables that directly encode or closely proxy the outcome, such as \u0026ldquo;difference between expected and realised completion time\u0026rdquo; or flags indicating that the student ceased enrolling before degree completion.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003ePreprocessing leakage across time or folds\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eFitting scalers, encoders, or feature selectors on the entire dataset\u0026mdash;including future cohorts\u0026mdash;before splitting into training and test sets, or using target-encoding schemes that inadvertently peek at labels in the validation fold.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eCohort and policy-regime leakage\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eMixing cohorts that experienced different policies or macro-contexts in ways that allow models to infer outcomes from regime identifiers that are not available (or stable) at deployment for new cohorts.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eSome of these issues (e.g., incorrect scaling procedures) can be mitigated with rigorous pipeline implementation. Others\u0026mdash;especially those involving outcome-proximal features\u0026mdash;must be addressed at the feature-design and temporal-modelling level. CAPIRE is explicitly built to operate at that level.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec25\" class=\"Section2\"\u003e\u003ch2\u003e5.4. How CAPIRE\u0026rsquo;s feature engineering prevents leakage\u003c/h2\u003e\u003cp\u003eLeakage prevention is embedded in CAPIRE\u0026rsquo;s design through four complementary mechanisms:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTemporal eligibility tags in the feature dictionary\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eEach feature is annotated as:\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eVOT-admissible\u003c/b\u003e (eligible for early-warning and archetype models),\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003epost-VOT\u003c/b\u003e (restricted to retrospective or explanatory analyses), or\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003erestricted\u003c/b\u003e (requiring special justification or anonymisation).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eThis enables independent auditing of temporal legitimacy at the feature level.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003col start=\"2\"\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eExplicit VOT filters in construction rules\u003c/b\u003e: Feature formulas are written to include time bounds by design (e.g., \u0026ldquo;coUniversity X of failed core courses up to and including term 2\u0026rdquo;, \u0026ldquo;velocity of advance at VOT\u0026rdquo;). Implementation templates (SQL, Python, R) systematically incorporate conditions such as term\u0026thinsp;\u0026lt;\u0026thinsp;=\u0026thinsp;t_VOT, reducing the risk of developers inadvertently crossing the temporal boundary.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eCohort- and time-aware data splitting\u003c/b\u003e: For predictive tasks, CAPIRE favours cohort-based or time-based splits\u0026mdash;training on earlier cohorts and testing on later ones\u0026mdash;over random splits. Preprocessing (scaling, encoding, feature selection) is fitted exclusively on the training partition in each fold and then applied to validation/test sets, preventing information from flowing \u0026ldquo;backwards in time\u0026rdquo;.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eDesign rules forbidding outcome-proximal features at VOT\u003c/b\u003e: For early-warning models, the framework explicitly forbids:\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003euse of any grade or course outcome recorded after \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{t}_{\\text{VOT}}\\)\u003c/span\u003e\u003c/span\u003e;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003efeatures aggregating over the entire enrolment history (e.g., total failed courses, total inactive terms);\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eindicators derived from final status (graduate vs. dropout) or closely related proxies.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eWhen such variables are valuable for retrospective explanation (e.g., for case studies or causal analyses), they are computed in clearly separated post-VOT feature sets that cannot be accidentally incorporated into VOT-based models.\u003c/p\u003e\u003cp\u003eIn addition, CAPIRE encourages \u003cb\u003ediagnostic checks\u003c/b\u003e for possible leakage, such as comparing performance under random vs. time-based splits, and benchmarking against models trained with explicitly post-VOT features. Large discrepancies in performance can signal hidden temporal leakage and trigger further inspection.\u003c/p\u003e\u003cp\u003eRather than treating leakage as an incidental implementation problem, CAPIRE elevates it to a first-order design constraint.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec26\" class=\"Section2\"\u003e\u003ch2\u003e5.5. Generalising VOT to other programmes and modalities\u003c/h2\u003e\u003cp\u003eAlthough the implementation discussed in this paper concerns a multi-year engineering programme, the VOT concept and associated design rules generalise to a wide range of educational settings:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eShort-cycle and professional programmes\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eIn two-year or shorter programmes, VOT may be defined at the end of the first major assessment block or when a given proportion of credits (e.g., 25\u0026ndash;30%) has been attempted. Features then focus on the earliest robust indicators of friction and pacing.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eModular, competency-based, and micro-credential systems\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eWhere progression is organised in modules or competencies rather than fixed terms, VOT can be defined as the point where a student accumulates a specified number of modules or attempts. Temporal slicing then operates over module sequences, and features summarise early module completion patterns, retries, and idle periods.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eOnline, blended, and MOOC-like environments\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eVOT may be set in terms of weeks since registration or proportion of content accessed. Leakage prevention requires excluding engagement metrics that implicitly look beyond this cut-off (e.g., final exam participation), while including early engagement signals (first-week activity, initial assessment performance).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003ePart-time and non-traditional trajectories\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eFor heterogeneous pacing, VOT is better expressed in terms of attempted or completed credits rather than elapsed calendar time (e.g., \u0026ldquo;after the student has attempted 40 credits\u0026rdquo;). This avoids penalising slower but still viable trajectories.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eCross-institutional or system-level analytics\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eFor comparative studies, VOT can be standardised in terms of relative progression (e.g., completion of the first curricular block) rather than absolute years. Each institution then maps this conceptual VOT to local structures (terms, modules).\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eAcross these modalities, the core logic of VOT remains unchanged:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eIdentify a point at which intervention remains meaningful;\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eRestrict the feature set to data legitimately available by that point;\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eMake these restrictions explicit and auditable.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eBy elevating VOT from an informal intuition (\u0026ldquo;early enough\u0026rdquo;) to a formal design parameter, CAPIRE provides a reusable template for building early-warning and archetype-discovery systems that are both accurate and temporally honest.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec27\" class=\"Section2\"\u003e\u003ch2\u003e5.6. Sensitivity analysis for DBSCAN noise cases\u003c/h2\u003e\u003cp\u003eIn addition to the main UMAP\u0026thinsp;+\u0026thinsp;DBSCAN clustering workflow, we conducted a sensitivity analysis of the cases labelled as \u003cem\u003enoise\u003c/em\u003e (cluster\u0026thinsp;=\u0026thinsp;\u0026minus;\u0026thinsp;1) by DBSCAN. This was motivated by a known limitation of density-based clustering methods: sparse but meaningful minority structures may be incorrectly classified as noise in low-dimensional embeddings.\u003c/p\u003e\u003cp\u003eThe analysis proceeded in two stages. First, we compared outlier students with non-outlier students across a set of theoretically grounded N2\u0026ndash;N4 indicators (e.g., age at entry, VOT-window mean IFC, maximum gap between enrolments) using descriptive statistics and non-parametric tests (Mann\u0026ndash;Whitney U and Levene\u0026rsquo;s tests). This allowed us to assess whether the noise group exhibited high internal heterogeneity (as would be expected for genuine noise) or instead formed a coherent pattern.\u003c/p\u003e\u003cp\u003eSecond, we performed dedicated re-clustering of the outlier subset using algorithms such as \u003cem\u003ek\u003c/em\u003e-means, hierarchical clustering, and HDBSCAN. The results (reported in detail in Section \u003cspan refid=\"Sec51\" class=\"InternalRef\"\u003e7.4.5\u003c/span\u003e) show that the DBSCAN noise group contains at least two well-separated minority configurations with high internal cohesion, contradicting the interpretation of these cases as unstructured residuals.\u003c/p\u003e\u003cp\u003eThis sensitivity analysis strengthens the transparency and ecological validity of the clustering pipeline. It demonstrates that CAPIRE does not simply discard a quarter of the cohort as opaque noise but explicitly documents and interrogates the structure of these cases, responding directly to peer-review concerns about representativeness and coverage.\u003c/p\u003e\u003c/div\u003e"},{"header":"6. IMPLEMENTATION AND PIPELINE ARCHITECTURE","content":"\u003cp\u003eCAPIRE is implemented as a modular, reproducible pipeline with strict temporal validation, designed to transform heterogeneous institutional data into feature matrices ready for topological analysis, archetype discovery, and predictive modelling. The architecture prioritises three fundamental principles: reproducibility, traceability, and absolute prevention of data leakage. The result is a system capable of operating from ad-hoc analytical environments to automated, institution-scale deployments.\u003c/p\u003e\u003cdiv id=\"Sec29\" class=\"Section2\"\u003e\u003ch2\u003e6.1. Overview of CAPIRE-Core Architecture\u003c/h2\u003e\u003cp\u003eThe architecture of CAPIRE-core follows a separation of responsibilities pattern, where each module operates independently and is verifiable. The pipeline is organised into four macro-layers:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eConfiguration Layer\u003c/b\u003e: Centralises all analytical decisions outside the code. Defines temporal window parameters, activation of feature levels, validation rules, imputation strategies, and weights for synthetic indices.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eData Ingestion \u0026amp; Validation\u003c/b\u003e: Establishes connectors capable of extracting data from SIS, LMS, administrative files, and macroeconomic sources. Each dataset undergoes structural, referential, and temporal validation before entering the pipeline.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eFeature Engineering Layer\u003c/b\u003e: Implements extractors by level (N1\u0026ndash;N4). Each extractor applies VOT, generates derived transformations, and ensures that no attribute uses information after the temporal cutoff point.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eAssembly \u0026amp; Metadata Layer\u003c/b\u003e: Consolidates the final set of features, generates standardised artefacts (Parquet matrices, dictionaries, JSON sidecars), and documents each feature matrix with configuration hashes and temporal audit trails.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eArchitectural principles\u003c/b\u003e:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eIdempotence\u003c/b\u003e: Any execution with the same configuration produces exactly the same results.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eModularity\u003c/b\u003e: The N1\u0026ndash;N4 extractors function as decoupled blocks; institutions without census data, for example, can deactivate N1 without affecting the rest of the pipeline.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTraceability\u003c/b\u003e: Each artefact is versioned with its complete configuration and cryptographic hash.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eEarly validation\u003c/b\u003e: Errors are detected at entry, never during modelling.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e6.1\u003c/span\u003e summarises this modular design and the connections between layers.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec30\" class=\"Section2\"\u003e\u003ch2\u003e6.2. From Raw Data to Feature Matrices: ETL Workflow\u003c/h2\u003e\u003cp\u003eThe CAPIRE ETL pipeline follows a deterministic flow, composed of four critical stages:\u003c/p\u003e\u003cp\u003e\u003cb\u003eStage 1: Data Ingestion (Extract)\u003c/b\u003e\u003c/p\u003e\u003cp\u003eInstitutional systems often store information in heterogeneous schemas (SQL databases, CSV exports, Excel spreadsheets). For this, CAPIRE-Core includes specific connectors that:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eestandarice fiel names;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003enormalise date formats, identifiers, and postcodes;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003elink student databases with census or macroeconomic data via geographic or temporal keys.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe result is fully normalised, consistent, and comparable datasets across cohorts.\u003c/p\u003e\u003cp\u003e\u003cb\u003eStage 2: Preprocessing \u0026amp; Validation (Transform)\u003c/b\u003e\u003c/p\u003e\u003cp\u003eAll information undergoes strict validation of:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003edata types (numeric, date, categorical),\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eplausible ranges (age, grades, rates),\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ereferential integrity (every enrolment must have a valid student),\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003etemporal consistency (no event can occur before the student\u0026rsquo;s entry),\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ecompleteness rules (essential fields cannot be missing).\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eIf validation fails, the pipeline halts and generates an error report.\u003c/p\u003e\u003cp\u003eMissing data management follows differentiated strategies for each level:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN1\u003c/b\u003e: geographic imputation\u0026thinsp;+\u0026thinsp;absence indicators,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN2\u003c/b\u003e: mean imputation\u0026thinsp;+\u0026thinsp;indicators,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN3\u003c/b\u003e: never imputed (absence is informative),\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN4\u003c/b\u003e: temporal interpolation where appropriate.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eStage 3: Feature Engineering (Transform)\u003c/b\u003e\u003c/p\u003e\u003cp\u003eEach level of the multilevel model has a dedicated extractor:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN1\u003c/b\u003e: socioeconomic and demographic context;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN2\u003c/b\u003e: self-declared and transversal attributes;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN3\u003c/b\u003e: academic behavioural footprints (core of the framework);\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN4\u003c/b\u003e: macroeconomic and cohort conditions.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eVOT enforcement is integrated into each extractor: no feature may use data after the defined temporal window. This guarantees total absence of data leakage, even if developers incorporate new variables.\u003c/p\u003e\u003cp\u003eIt also includes calculation of explicit interactions between levels (e.g., NBI \u0026times; pass rate), and derivation of synthetic indices such as the IFC.\u003c/p\u003e\u003cp\u003e\u003cb\u003eStage 4: Feature Matrix Assembly (Load)\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThe pipeline output is a compressed feature matrix in columnar format, accompanied by a metadata file documenting:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003enumber of features,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003epercentage of missing values,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eextraction configurations,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003efull configuration hash,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eexact execution timestamp,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eincluded cohorts and final sample size.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThis mechanism makes it possible to reproduce any historical matrix with bit-level precision.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec31\" class=\"Section2\"\u003e\u003ch2\u003e6.3. Reproducibility and Configuration Management\u003c/h2\u003e\u003cp\u003eCAPIRE-Core requires that all analytical decisions be external to the code and audited via:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eYAML configuration files (define pipeline parameters),\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eJSON validation schemas (define structural rules),\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eSHA-256 cryptographic hashes (uniquely identify each configuration).\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe combination of these three elements ensures that:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eany pipeline can be regenerated.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eany analytical error can be traced to its specific configuration.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003einstitutional teams work with a versioned, auditable, and comparable system across years.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe system makes CAPIRE a scientifically robust and standardised tool, aligned with the reproducibility requirements of Q1 journals.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec32\" class=\"Section2\"\u003e\u003ch2\u003e6.4. Computational Considerations and Scalability\u003c/h2\u003e\u003cp\u003eThe pipeline is designed to be efficient on modest hardware and scalable in institutional environments. Medium-sized datasets (\u0026asymp;\u0026thinsp;1,300 students) are fully processed in less than a minute.\u003c/p\u003e\u003cp\u003eIt scales almost linearly with the number of students and courses per student thanks to:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003ebatch processing,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ecolumnar reading,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eoptional parallelisation by student,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eoptimised IFC calculations.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eFor large institutions (\u0026gt;\u0026thinsp;100,000 students), parallel processing and columnar storage are recommended. The pipeline can be integrated with distributed systems if the institution has greater infrastructure.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec33\" class=\"Section2\"\u003e\u003ch2\u003e6.5. Deployment Modes\u003c/h2\u003e\u003cp\u003eCAPIRE-Core supports three deployment modes, according to the institution\u0026rsquo;s technological maturity:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eMode 1 \u0026mdash; Batch Processing (Entry-Level)\u003c/b\u003e:\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eAnalysts run the pipeline manually at the end of the semester.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eIdeal for planning offices with minimal infrastructure.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eMode 2 \u0026mdash; Scheduled Automation (Intermediate)\u003c/b\u003e:\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eThe pipeline runs on a scheduled basis (e.g., weekly), with direct access to SIS/LMS.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eUsed for quarterly early warning systems.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eMode 3 \u0026mdash; Real-Time Integration (Advanced)\u003c/b\u003e:\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eCAPIRE operates as a microservice queried by the student information system.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eProvides archetype, risk, and recommendation in real time when opening the student\u0026rsquo;s record.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eCurrent situation: FACULTY B\u0026ndash;UNIVERSITY X operates in Mode 2, with migration to Mode 3 planned for 2026.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec34\" class=\"Section2\"\u003e\u003ch2\u003e6.6. Quality Assurance and Testing\u003c/h2\u003e\u003cp\u003eQuality assurance follows a pyramidal approach:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eUnit tests\u003c/b\u003e: Verify internal calculations of extractors.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eIntegration tests\u003c/b\u003e: Ensure the complete pipeline produces valid matrices.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eValidation tests\u003c/b\u003e: Confirm strict compliance with VOT.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eRegression tests\u003c/b\u003e: Compare new matrices with historical matrices to ensure reproducibility.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe critical test is temporal validation: it is verified that no feature uses data after the cutoff. Tests are run automatically via continuous integration.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec35\" class=\"Section2\"\u003e\u003ch2\u003e6.7. Software Availability and Licensing\u003c/h2\u003e\u003cp\u003eCAPIRE-core will be released as open software under the MIT licence, with:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003epublic repository,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ecomplete documentation and tutorials,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003esynthetic dataset for validation,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eextensible modules for new feature types and new connectors.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eInstitutional contribution is encouraged to extend the ecosystem, especially in:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003especific extractors for online modalities,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003enew LMS connectors (Canvas, Blackboard),\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eadvanced behavioural features,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003elongitudinal monitoring pipelines.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e"},{"header":"7. EMPIRICAL ILLUSTRATION: STUDENT TRAJECTORY ARCHETYPES AT UNIVERSITY X","content":"\u003cdiv id=\"Sec37\" class=\"Section2\"\u003e\u003ch2\u003e7.1. Institutional Context and Dataset\u003c/h2\u003e\u003cdiv id=\"Sec38\" class=\"Section3\"\u003e\u003ch2\u003e7.1.1. Institutional Setting\u003c/h2\u003e\u003cp\u003eThe empirical illustration of CAPIRE was conducted at the Facultad de Ciencias Exactas y Tecnolog\u0026iacute;a of Universidad Nacional de Region Z (FACULTY B-UNIVERSITY X), a public engineering school in northwest Country Q. FACULTY B-UNIVERSITY X offers six undergraduate engineering programs; this study focuses on Civil Engineering, a traditional program characterized by:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eSequential prerequisites\u003c/b\u003e: Long chains of dependent courses in which progress in advanced subjects is strictly conditioned on completion of foundational courses.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eHigh mathematical rigor\u003c/b\u003e: First-year subjects such as Calculus I\u0026ndash;III, Physics I\u0026ndash;II and Linear Algebra act as \u0026ldquo;filter courses\u0026rdquo; with historically high failure and withdrawal rates.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eSocioeconomically diverse intake\u003c/b\u003e: Most students come from middle- and lower-income households; between 35% and 40% work while studying.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eOpen admission\u003c/b\u003e: In line with many Latin American public universities, there is no entrance examination; all high-school graduates are admitted.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThese features make FACULTY B-UNIVERSITY X broadly representative of public engineering institutions in Latin America facing structural challenges in retention and time-to-degree (Giovagnoli, 2002; Garc\u0026iacute;a de Fanelli, 2014).\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec39\" class=\"Section3\"\u003e\u003ch2\u003e7.1.2. Dataset Description\u003c/h2\u003e\u003cp\u003eThe empirical sample comprises \u003cb\u003e1,343 Civil Engineering students\u003c/b\u003e from the 2004\u0026ndash;2019 cohorts, covering 15 academic years. The analytical dataset integrates the four CAPIRE levels:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN1 \u0026ndash; Pre-entry structural context\u003c/b\u003e: demographic variables (age at enrolment, place of origin), postal-code\u0026ndash;linked neighborhood deprivation indices, and local labour market indicators.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN2 \u0026ndash; Entry moment\u003c/b\u003e: high-school GPA, employment status at enrolment, and prior educational trajectory.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN3 \u0026ndash; Academic performance\u003c/b\u003e: course enrolments, pass/fail outcomes and exam attempts during the observation window.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN4 \u0026ndash; Trajectory dynamics\u003c/b\u003e: temporal ordering of course attempts, gaps between enrolments and changes in course load over time.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe \u003cb\u003eValue of Observation Time (VOT)\u003c/b\u003e was set to \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}=1.5\\:\\)\u003c/span\u003e\u003c/span\u003eyears (end of the second academic year). All features were constructed using only information available up to \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}\\)\u003c/span\u003e\u003c/span\u003e, in strict compliance with the leakage-prevention principles described earlier. Full-trajectory outcomes (attrition vs. graduation, time-to-degree) were reserved solely for ex-post evaluation and were never used in feature construction.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec40\" class=\"Section3\"\u003e\u003ch2\u003e7.1.3. Descriptive Statistics\u003c/h2\u003e\u003cp\u003eAt \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}\\)\u003c/span\u003e\u003c/span\u003e, the average age at enrolment was 18.7 years (SD\u0026thinsp;=\u0026thinsp;1.7), with women representing 18.2% of the sample and 3.9% of students reporting employment at entry. Trajectories in the first 1.5 years are already fragile: students attempt close to the nominal first-year course load, but a large fraction of attempts end in failure or \u0026ldquo;libre\u0026rdquo; (dropping the course without taking the exam). Over the full trajectory, the \u003cb\u003eattrition rate reaches 56.7%\u003c/b\u003e, the \u003cb\u003egraduation rate 14.8%\u003c/b\u003e, and the mean time-to-degree is 7.2 years, substantially exceeding nominal program length.\u003c/p\u003e\u003cp\u003eMissing data are concentrated in two blocks:\u003c/p\u003e\u003cp\u003e(1) macro-economic indicators for rural postal codes (\u0026asymp;\u0026thinsp;28% missing), and\u003c/p\u003e\u003cp\u003e(2) grade-based metrics for students who drop all courses without sitting exams (\u0026asymp;\u0026thinsp;42% missing in those variables).\u003c/p\u003e\u003cp\u003eWe combined median imputation for selected N1 features, exclusion of grade-based variables from the clustering step, and explicit missingness indicators. Missingness pattern analysis (Little\u0026rsquo;s test) did not detect systematic associations between missingness and attrition, supporting the assumption that missingness does not bias the archetype discovery.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec41\" class=\"Section2\"\u003e\u003ch2\u003e7.2. Feature Engineering Implementation\u003c/h2\u003e\u003cp\u003eThe CAPIRE multilevel feature dictionary (Section \u003cspan refid=\"Sec8\" class=\"InternalRef\"\u003e3\u003c/span\u003e) was operationalized to produce \u003cb\u003e44 features\u003c/b\u003e grouped across four levels.\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN1 \u0026ndash; Structural context (12 features).\u003c/b\u003e These variables capture socioeconomic vulnerability via a neighborhood deprivation index (NBI), local unemployment and informality rates at enrolment, and indicators of macro-economic crisis periods. Interaction terms such as \u003cem\u003eNBI \u0026times; pass rate\u003c/em\u003e link structural disadvantage to observed performance.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN2 \u0026ndash; Entry moment (6 features).\u003c/b\u003e Features include age at enrolment, employment status, geographic origin (rural vs. urban; distance to campus), and temporally aligned educational and economic context (e.g., number of teacher strikes in the 24 months preceding enrolment, inflation at t₀).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN3 \u0026ndash; Academic performance snapshot (16 features).\u003c/b\u003e Up to \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}\\)\u003c/span\u003e\u003c/span\u003e, we summarize the academic record using coUniversity Xs of failed courses, proportion of \u0026ldquo;libre\u0026rdquo; enrolments, mean and median grades, and variability of performance. A central construct is the \u003cb\u003eInstructional Friction Coefficient (IFC)\u003c/b\u003e, which quantifies course-level structural difficulty by combining failure and withdrawal rates and allows identification of institutional \u0026ldquo;chokepoint\u0026rdquo; courses.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN4 \u0026ndash; Trajectory dynamics (10 features).\u003c/b\u003e These variables describe temporal patterns such as the maximum gap between consecutive enrolments, the trend in course load across semesters, the ratio of completed to expected courses at \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}\\)\u003c/span\u003e\u003c/span\u003e, several cross-level interaction terms (e.g., friction \u0026times; dropout, age \u0026times; re-enrolment) and an entropy-like index capturing how erratic or consistent the sequence of states (passed/failed/dropped/not attempted) is over time.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eAll features strictly respect \u003cb\u003eVOT compliance\u003c/b\u003e: no variable uses information beyond 1.5 years after enrolment; macro-indicators are aligned with the year of entry; and the attrition label is never used for feature construction. Configurations are versioned so that the exact feature set can be regenerated.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec42\" class=\"Section2\"\u003e\u003ch2\u003e7.3. Archetype Discovery Results\u003c/h2\u003e\u003cdiv id=\"Sec43\" class=\"Section3\"\u003e\u003ch2\u003e7.3.1. Dimensionality Reduction and Clustering\u003c/h2\u003e\u003cp\u003eGiven the 44-dimensional feature space, we first applied \u003cb\u003eUniform Manifold Approximation and Projection (UMAP)\u003c/b\u003e to obtain a three-dimensional embedding that preserves local structure while facilitating clustering. The resulting representation captures slightly more than half of the total variance and provides a well-separated manifold suitable for density-based clustering.\u003c/p\u003e\u003cp\u003eWe experimented with \u003cb\u003eMapper-based TDA\u003c/b\u003e using multiple lenses and cover parameters, but Mapper consistently produced dozens of micro-clusters, many too small to support institutional interventions. This mismatch reflects a tension between fine-grained topological exploration and the need for a limited number of robust, interpretable types. We therefore adopted a more pragmatic strategy: \u003cb\u003eDBSCAN\u003c/b\u003e applied directly to the UMAP embedding.\u003c/p\u003e\u003cp\u003eDBSCAN hyperparameters were tuned using k-distance plots and cluster validity indices. The final solution yielded \u003cb\u003e18 clusters\u003c/b\u003e, of which \u003cb\u003e13 met our interpretability criterion (\u0026ge;\u0026thinsp;40 students)\u003c/b\u003e and were retained as archetypes. Smaller clusters were merged with density-labelled noise for analysis. Overall, \u003cb\u003e847 students (63.1%)\u003c/b\u003e received a stable archetype label; 356 (26.5%) were classified as noise; and 140 (10.4%) belonged to small clusters merged into the residual group.\u003c/p\u003e\u003cp\u003eCluster validity was acceptable for a heterogeneous educational dataset: the silhouette coefficient was 0.318, the Calinski\u0026ndash;Harabasz index 590.4 and the Davies\u0026ndash;Bouldin index 0.702, all consistent with well-separated yet overlapping clusters in a complex social system.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec44\" class=\"Section3\"\u003e\u003ch2\u003e7.3.2. Archetype Characterization\u003c/h2\u003e\u003cp\u003eEach archetype was profiled using descriptive statistics of the 44 features plus full-trajectory outcomes (attrition and graduation). Table\u0026nbsp;7.2 (not reproduced here in full) summarizes the five largest archetypes. Key patterns include:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eArquetipo 5 \u0026ndash; High-Risk: Sustained Friction.\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eStudents with high and persistent curricular friction: around three failed or dropped courses within \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}\\)\u003c/span\u003e\u003c/span\u003e, dropout rates near 75%, and IFC values among the highest across Q1\u0026ndash;Q4. Attrition reaches 74.3%, with very low graduation. These students are structurally embedded in \u0026ldquo;chokepoint\u0026rdquo; courses and require intensive academic support.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eArquetipo 2 \u0026ndash; Moderate-Risk: Extra-Academic Factors.\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eStudents with relatively low friction (low \u0026ldquo;libre\u0026rdquo; proportion and moderate failure rates) but still high attrition (\u0026asymp;\u0026thinsp;59%). Their trajectories suggest that withdrawal is driven less by academic failure and more by unobserved extra-academic pressures (financial stress, health, family obligations), indicating the need for counseling and social support rather than purely curricular interventions.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eArquetipo 9 \u0026ndash; Critical-Risk: Total Disengagement.\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eStudents whose entire first-year record consists of dropped courses (100% \u0026ldquo;libre\u0026rdquo;) and virtually no exams taken. Attrition exceeds 80%. These students never establish an academic foothold and would benefit from pre-enrolment orientation, realistic expectation-setting and first-weeks intensive support.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eArquetipo 16 \u0026ndash; Low-Risk: Success Model.\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eStudents with consistently low friction, high pass rates, no significant gaps and early completion. Attrition is about 21% and graduation above 27%. They represent \u0026ldquo;success trajectories\u0026rdquo; and are natural candidates for peer-mentoring roles and for defining normative curricular benchmarks.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eArquetipo 0 \u0026ndash; Moderate-Risk: Young Strivers.\u003c/b\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eThe youngest group on average, with no employment at entry but high friction in early courses and elevated attrition (\u0026asymp;\u0026thinsp;66%). They appear academically motivated but underprepared for the level of rigor, suggesting the value of bridge programs and explicit training in study strategies.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 7.1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eSummary of the five largest archetypes (N\u0026thinsp;=\u0026thinsp;1,343).\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eArchetype ID\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eArchetype Label\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eN1\u0026ndash;N2 Profile\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eN3 Friction Pattern\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eN4 Trajectory Pattern\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eAttrition Rate (%)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eArquetipo_5\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eHigh-Risk: Early Performance Collapse\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMedio-bajo SES; ingreso est\u0026aacute;ndar; edad levemente superior al promedio\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eBajo rendimiento inicial; alta tasa de libres; fuerte dependencia de materias b\u0026aacute;sicas\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eTrayectoria inestable; repetici\u0026oacute;n temprana; riesgo persistente\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e74.3%\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eArquetipo_9\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eModerate-Risk: Low GPA\u0026thinsp;+\u0026thinsp;Course Friction\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eSES intermedio; ingreso tradicional\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eDesempe\u0026ntilde;o inicial bajo; alta proporci\u0026oacute;n de desaprobaciones\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eOscilaciones moderadas; progresi\u0026oacute;n lenta\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e84.5%\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eArquetipo_8\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eLow-Middle SES\u0026thinsp;+\u0026thinsp;Mixed Performance\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eSES bajo; ingreso temprano; edad baja\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eRendimiento heterog\u0026eacute;neo; mezcla de aprobaciones y libres\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eTrayectoria zigzagueante pero no cr\u0026iacute;tica\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e64.1%\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eArquetipo_0\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eAdult Entrants with High Friction\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eEdad muy superior al promedio; empleo frecuente\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eNotas bajas; dificultades en tramos iniciales\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eTrayectoria fragmentada; interrupciones recurrentes\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e66.1%\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eArquetipo_11\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eModerate-Risk: High Course Load\u0026thinsp;+\u0026thinsp;Low Success\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eSES medio; estudiante t\u0026iacute;pico\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eAlta tasa de materias cursadas con rendimiento deficiente\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eProgresi\u0026oacute;n lenta con acumulaci\u0026oacute;n de deuda acad\u0026eacute;mica\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e71\u0026ndash;72%\u003c/b\u003e (seg\u0026uacute;n z-score\u0026thinsp;+\u0026thinsp;estimaci\u0026oacute;n)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eA heatmap of standardized features across archetypes (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e7.5\u003c/span\u003e) highlights sharp contrasts\u0026mdash;for example, the extreme \u0026ldquo;libre\u0026rdquo; rates of Arquetipo 9 and the low IFC and high progress indicators of Arquetipo 16.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec45\" class=\"Section3\"\u003e\u003ch2\u003e7.3.3. Filter Subjects and Curricular Friction\u003c/h2\u003e\u003cp\u003eAt the course level, we computed the \u003cb\u003eInstructional Friction Coefficient\u003c/b\u003e across Q1\u0026ndash;Q4. The top ten \u0026ldquo;filter subjects\u0026rdquo; include advanced structural mechanics, hydrology, basic hydraulics, pavement design, upper-level calculus, statistics and key materials courses. Civil Engineering subjects dominate the friction ranking, with mathematics courses acting as cross-program barriers.\u003c/p\u003e\u003cp\u003eFrom an institutional perspective, this confirms that attrition is not purely idiosyncratic: specific curricular components systematically generate friction. CAPIRE provides a quantitative map of those bottlenecks, which can be used to prioritize pedagogical redesign (e.g., active learning, peer-assisted instruction, changes in prerequisite structures).\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec46\" class=\"Section2\"\u003e\u003ch2\u003e7.4. Archetype Validation\u003c/h2\u003e\u003cp\u003eTo ensure that the 13 archetypes represent genuine and robust patterns, we conducted several complementary validation analyses.\u003c/p\u003e\u003cdiv id=\"Sec47\" class=\"Section3\"\u003e\u003ch2\u003e7.4.1. Bootstrap Stability\u003c/h2\u003e\u003cp\u003eUsing 100 bootstrap resamples of the original dataset, we re-estimated the full UMAP\u0026thinsp;+\u0026thinsp;DBSCAN pipeline and compared cluster assignments via the Adjusted Rand Index (ARI). The mean ARI was 0.614 (SD\u0026thinsp;=\u0026thinsp;0.081; 95% CI [0.444, 0.780]), indicating \u003cb\u003esubstantial stability\u003c/b\u003e in cluster structure despite sampling variability\u0026mdash;particularly remarkable given the heterogeneity typical of student trajectories.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec48\" class=\"Section3\"\u003e\u003ch2\u003e7.4.2. Permutation Significance Test\u003c/h2\u003e\u003cp\u003eTo test whether the observed clustering outperforms random partitions, we built a null distribution of silhouette scores from 100 random permutations of cluster labels. The real silhouette score (0.318) was far above the null mean (\u0026minus;\u0026thinsp;0.122), with an empirical p-value of 0.0099. Thus, the observed clusters are \u003cb\u003ehighly unlikely\u003c/b\u003e to arise by chance (p\u0026thinsp;\u0026lt;\u0026thinsp;0.01).\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec49\" class=\"Section3\"\u003e\u003ch2\u003e7.4.3. Temporal Validation Across Cohorts\u003c/h2\u003e\u003cp\u003eWe assessed temporal stability by splitting the sample into two independent periods (2004\u0026ndash;2010 and 2011\u0026ndash;2019), projecting both through the same UMAP embedding and comparing archetype distributions and attrition rates. Differences in attrition per archetype were consistently below 5 percentage points. The overall attrition rate decreased modestly in later cohorts, likely reflecting institutional policies, but the \u003cb\u003erelative profiles and risks of each archetype remained stable\u003c/b\u003e. This supports the interpretation of archetypes as persistent structural patterns rather than cohort-specific artefacts.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec50\" class=\"Section3\"\u003e\u003ch2\u003e7.4.4. Sensitivity to Hyperparameters\u003c/h2\u003e\u003cp\u003eWe explored the sensitivity of the archetypes to variations in UMAP and DBSCAN hyperparameters via a small grid of alternative configurations. Across 27 combinations, the ARI relative to the reference clustering averaged 0.74, with a minimum of 0.62 and a maximum of 0.89. This indicates that archetype structure is \u003cb\u003erobust to reasonable changes in modelling choices\u003c/b\u003e and is not an artefact of a particular parameter setting.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec51\" class=\"Section3\"\u003e\u003ch2\u003e7.4.5. Analysis of DBSCAN \u0026ldquo;Noise\u0026rdquo;\u003c/h2\u003e\u003cp\u003eBecause DBSCAN labels a substantial fraction of students (26.5%) as noise, we analysed this group separately. Compared with clustered students, outliers had almost identical mean age and friction but \u003cb\u003eshorter gaps between enrolments and lower variance\u003c/b\u003e in the analysed variables. Non-parametric tests (Mann\u0026ndash;Whitney and Levene) confirmed statistically significant differences in distributions and lower dispersion among outliers.\u003c/p\u003e\u003cp\u003eRe-clustering only the outliers revealed at least two clearly separated micro-archetypes with high silhouette scores, and additional smaller groups under HDBSCAN. This suggests that the \u0026ldquo;noise\u0026rdquo; does not constitute random chaos but rather \u003cb\u003ecohesive minority trajectories\u003c/b\u003e that are not dense enough to form DBSCAN clusters. These residual structures deserve explicit modelling in future work.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec52\" class=\"Section2\"\u003e\u003ch2\u003e7.5. Predictive Performance: Early-Warning System\u003c/h2\u003e\u003cdiv id=\"Sec53\" class=\"Section3\"\u003e\u003ch2\u003e7.5.1. Model Development\u003c/h2\u003e\u003cp\u003eTo translate archetypes into an operational early-warning system, we trained a \u003cb\u003emulticlass classifier\u003c/b\u003e to predict archetype membership at \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}=1.5\\)\u003c/span\u003e\u003c/span\u003eyears. The model uses only the feature set available at \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}\\)\u003c/span\u003e\u003c/span\u003e, with 13 classes (one per valid archetype) and 847 labelled students (those assigned to archetypes). Outliers were excluded from training to avoid conflating majority patterns with minority residuals.\u003c/p\u003e\u003cp\u003eA Random Forest model, tuned via stratified cross-validation, provided the best balance between accuracy and interpretability. The train\u0026ndash;test split (70/30) preserved the proportion of each archetype.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec54\" class=\"Section3\"\u003e\u003ch2\u003e7.5.2. Overall Performance\u003c/h2\u003e\u003cp\u003eOn the held-out test set, the model achieved:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eAccuracy\u003c/b\u003e: 94.9% (95.7% on training; 94.1% \u0026plusmn; 1.4% in cross-validation),\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eMacro F1-score\u003c/b\u003e: 0.948,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eSmall train\u0026ndash;test gap\u003c/b\u003e, indicating minimal overfitting.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eCompared to baselines, performance is substantially higher: the majority-class baseline would reach only 8.1% accuracy, and random assignment\u0026thinsp;\u0026asymp;\u0026thinsp;7.7%. The CAPIRE-based classifier thus improves predictive power by more than an order of magnitude, using only information available within the first 1.5 years\u0026mdash;well before the average dropout time of 2.8 years.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec55\" class=\"Section3\"\u003e\u003ch2\u003e7.5.3. Per-Archetype Performance\u003c/h2\u003e\u003cp\u003ePer-class F1-scores are uniformly high. High-risk archetypes (e.g., Arquetipo 1, 5 and 9) achieve F1\u0026thinsp;\u0026gt;\u0026thinsp;0.95, enabling reliable targeting of the most vulnerable students. The \u0026ldquo;success model\u0026rdquo; archetype (16) is also classified with perfect or near-perfect accuracy, making it feasible to systematically recruit exemplary students as mentors. Moderately risky archetypes show slightly lower but still strong performance (F1\u0026thinsp;\u0026asymp;\u0026thinsp;0.88\u0026ndash;0.90), with confusions primarily between adjacent risk profiles rather than between high- and low-risk groups. No archetype falls below F1\u0026thinsp;=\u0026thinsp;0.70.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec56\" class=\"Section3\"\u003e\u003ch2\u003e7.5.4. Feature Importance\u003c/h2\u003e\u003cp\u003eAn analysis of feature importance confirms the \u003cb\u003emultilevel nature\u003c/b\u003e of attrition mechanisms. The most predictive variables are:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003ecross-level interactions such as \u003cem\u003ecurricular friction \u0026times; dropout rate\u003c/em\u003e and \u003cem\u003eage \u0026times; re-enrolment attempts\u003c/em\u003e;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003efriction metrics in foundational courses;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ethe proportion of dropped courses;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003etrajectory-level indicators such as entropy of states, re-enrolment frequency and maximum gaps.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eNotably, purely structural N1 variables rarely appear among the top predictors. Their effect appears to be mediated through N2\u0026ndash;N4 variables (e.g., socioeconomic disadvantage \u0026rarr; need to work \u0026rarr; higher \u0026ldquo;libre\u0026rdquo; rates and gaps). This aligns with the CAPIRE hypothesis that structural vulnerability operates \u003cb\u003ethrough\u003c/b\u003e behavioural and temporal mechanisms rather than as a direct determinant.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec57\" class=\"Section3\"\u003e\u003ch2\u003e7.5.5. Deployment and Impact Projection\u003c/h2\u003e\u003cp\u003eThe trained classifier can be deployed as a back-end service in the institutional information system, assigning archetypes to students as soon as they reach \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}\\)\u003c/span\u003e\u003c/span\u003eand triggering pre-defined, archetype-specific interventions (Section \u003cspan refid=\"Sec58\" class=\"InternalRef\"\u003e7.6\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eA simple cost\u0026ndash;benefit projection, assuming modest reductions in attrition (10\u0026ndash;15 percentage points) for the most critical archetypes, suggests that targeted interventions could retain around \u003cb\u003e20 additional students per year\u003c/b\u003e. Over five years, this corresponds to roughly \u003cb\u003e100 additional graduates\u003c/b\u003e, increasing the overall graduation rate by approximately 13% relative to the baseline. Even under conservative assumptions about intervention costs and retained tuition, the net financial impact is positive, aside from reputational and social benefits.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec58\" class=\"Section2\"\u003e\u003ch2\u003e7.6. Institutional Interpretability and Actionable Insights\u003c/h2\u003e\u003cdiv id=\"Sec59\" class=\"Section3\"\u003e\u003ch2\u003e7.6.1. Representative Case Studies\u003c/h2\u003e\u003cp\u003eTo bridge statistical results with institutional experience, we constructed de-identified case vignettes for selected archetypes.\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eA student in \u003cb\u003eArquetipo 5\u003c/b\u003e exhibits repeated failures and withdrawals in filter subjects (Calculus, Physics), maintains enrolment for several semesters and then drops out. The profile combines moderate socioeconomic stress, part-time work and structurally high friction, pointing to early tutoring plus financial aid as plausible interventions.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eA student in \u003cb\u003eArquetipo 16\u003c/b\u003e progresses linearly, passes all foundational courses on first attempt and graduates within six years. This trajectory exemplifies a success model, suggesting that such students can be systematically recruited as peer mentors and that their strategies can inform institutional best-practice guidelines.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eA student in \u003cb\u003eArquetipo 2\u003c/b\u003e shows acceptable academic performance but withdraws following a family health crisis. Here, the data reveal a missed opportunity: the student was viable academically but lacked support in coping with life events. This points to the need for proactive counseling and emergency aid linked to sudden gaps in enrolment.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThese vignettes were presented to academic advisors and department heads, who consistently recognized the profiles and associated them with familiar categories (\u0026ldquo;chronic repeaters\u0026rdquo;, \u0026ldquo;good students lost to personal issues\u0026rdquo;, \u0026ldquo;exemplary students\u0026rdquo;). No archetype contradicted institutional experience, suggesting that CAPIRE\u0026rsquo;s data-driven segmentation is \u003cb\u003eecologically valid\u003c/b\u003e and complements practitioner knowledge.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec60\" class=\"Section3\"\u003e\u003ch2\u003e7.6.2. Archetype-Specific Interventions\u003c/h2\u003e\u003cp\u003eBuilding on archetype profiles, we elaborated an \u003cb\u003eintervention matrix\u003c/b\u003e that links each archetype to a priority level, dominant vulnerability and recommended institutional response. Critical-risk groups (e.g., Arquetipos 1, 5, 9) are associated with intensive tutoring in filter subjects, program redesign in high-friction courses and structured bridge programs. Moderate-risk groups (e.g., Arquetipos 0, 2) call for mentoring, study-skills training and strengthened psychosocial and financial support. Low-risk archetypes are not intervention targets but rather strategic resources (mentors, role models, benchmarks).\u003c/p\u003e\u003cp\u003eThe matrix also provides a staged implementation roadmap, beginning with pilots for a single archetype and progressively expanding towards full integration of CAPIRE in academic advising and institutional planning.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec61\" class=\"Section3\"\u003e\u003ch2\u003e7.6.3. Alignment with Institutional Knowledge\u003c/h2\u003e\u003cp\u003eQualitative feedback from 8 academic advisors and 3 department heads confirmed strong alignment between archetypes and existing informal categories used in advising. Interestingly, staff tended to \u003cb\u003eoverestimate\u003c/b\u003e the importance of pre-entry factors (N2) and \u003cb\u003eunderestimate\u003c/b\u003e trajectory dynamics (N4), illustrating common attribution biases: human observers focus on stable traits and neglect temporal processes. CAPIRE thus serves not only as a prediction tool but also as a \u003cb\u003econceptual reframing device\u003c/b\u003e, making dynamic mechanisms visible to institutional actors.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec62\" class=\"Section2\"\u003e\u003ch2\u003e7.7. Discussion: Lessons from UNIVERSITY X Implementation\u003c/h2\u003e\u003cdiv id=\"Sec63\" class=\"Section3\"\u003e\u003ch2\u003e7.7.1. CAPIRE Framework Validation\u003c/h2\u003e\u003cp\u003eThe FACULTY B-UNIVERSITY X case demonstrates that CAPIRE can:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eenforce strict temporal validity (VOT) and eliminate data leakage;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ediscover a manageable set of \u003cb\u003einterpretable archetypes\u003c/b\u003e recognized by practitioners;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eachieve \u003cb\u003estatistically robust\u003c/b\u003e clustering (bootstrap and permutation tests);\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003esupport \u003cb\u003ehighly accurate early prediction\u003c/b\u003e of archetype membership;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003etranslate predictions into a differentiated intervention portfolio;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eintegrate multilevel features (N1\u0026ndash;N4) in a single explanatory framework.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec64\" class=\"Section3\"\u003e\u003ch2\u003e7.7.2. Methodological Innovations Confirmed\u003c/h2\u003e\u003cp\u003eThree methodological choices are particularly reinforced:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eUMAP\u0026thinsp;+\u0026thinsp;DBSCAN vs. Mapper for archetype discovery\u003c/b\u003e: While Mapper TDA is valuable for exploratory topology, the combination of UMAP and density-based clustering proved better suited for obtaining a small number of robust, institutionally actionable archetypes.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eMultilevel feature engineering and interactions\u003c/b\u003e: Cross-level interaction terms (N3\u0026times;N4, N2\u0026times;N4) contributed disproportionately to predictive performance, empirically supporting the CAPIRE view that outcomes emerge from interactions across levels rather than from isolated variables.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eVOT-based leakage control\u003c/b\u003e: Setting \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}=1.5\\)\u003c/span\u003e\u003c/span\u003eyears struck a practical balance: the classifier achieved almost 95% accuracy while preserving a lead time of roughly 1.3 years before the typical dropout event, making early intervention realistically feasible.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec65\" class=\"Section3\"\u003e\u003ch2\u003e7.7.3. Limitations and Threats to Validity\u003c/h2\u003e\u003cp\u003eThe study faces several limitations:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eInternal validity.\u003c/b\u003e Students who drop out before \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}\\)\u003c/span\u003e\u003c/span\u003ecannot be fully observed; although sensitivity analyses with shorter VOT windows yield similar archetypes, some selection bias may remain. Self-reported data (e.g., work status) may underestimate informal employment.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eExternal validity.\u003c/b\u003e Results come from a single public engineering school in Country Q. Archetype structure might differ in private universities, non-STEM programs or other national systems, especially in more stable macro-economic contexts.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eConstruct validity.\u003c/b\u003e Archetype labels (\u0026ldquo;high-risk\u0026rdquo;, \u0026ldquo;success model\u0026rdquo;) are heuristic and probabilistic; boundaries are fuzzy. The interpretation of curricular friction assumes that dropped courses mark structural barriers, although strategic withdrawals may also occur.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eStatistical conclusion validity.\u003c/b\u003e Multiple comparisons across features and archetypes increase the risk of false positives; however, the main conclusions rely on effect sizes, stability metrics and permutation tests rather than isolated p-values. Students labelled as DBSCAN noise represent a non-negligible minority whose trajectories require more refined modelling.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec66\" class=\"Section3\"\u003e\u003ch2\u003e7.7.4. Comparison with Prior Attrition Models\u003c/h2\u003e\u003cp\u003eCompared with traditional regression-based approaches and more recent deep-learning models, the CAPIRE implementation at FACULTY B-UNIVERSITY X offers a distinct combination of properties:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eit relies solely on administrative data (no surveys), increasing scalability;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eit reaches higher or comparable predictive accuracy while maintaining \u003cb\u003eexplainability\u003c/b\u003e;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eit enforces temporal validity, an often neglected aspect in education data mining;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eand it links predictions to \u003cb\u003eexplicit archetypes and intervention strategies\u003c/b\u003e, closing the loop between analytics and policy.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eIn this sense, CAPIRE sits between theory-heavy but operationally vague models (e.g., Tinto\u0026rsquo;s integration framework) and highly predictive but opaque \u0026ldquo;black-box\u0026rdquo; models, providing a middle path of \u003cb\u003emechanistic, actionable explainability\u003c/b\u003e.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec67\" class=\"Section3\"\u003e\u003ch2\u003e7.7.5. Practical Implications\u003c/h2\u003e\u003cp\u003eFor university leadership, the results underscore the importance of investing in longitudinal data infrastructure and in differentiated support strategies aligned with archetype profiles. For researchers, they highlight the need to integrate multilevel feature engineering, topological tools and strict temporal validation. For policymakers, the findings emphasize that attrition is structurally heterogeneous and that \u003cb\u003esegment-specific interventions\u003c/b\u003e are more efficient than uniform policies.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec68\" class=\"Section2\"\u003e\u003ch2\u003e7.8. Conclusion of the Empirical Illustration\u003c/h2\u003e\u003cp\u003eThe FACULTY B-UNIVERSITY X case study shows that CAPIRE can transform conventional administrative data into a \u003cb\u003ecoherent, multilevel map of student trajectories\u003c/b\u003e. The 13 archetypes identified capture 63.1% of students, remain stable across cohorts, are statistically robust and are recognised by institutional stakeholders. A leakage-aware classifier can assign students to archetypes with high accuracy at 1.5 years, providing a generous window for targeted intervention.\u003c/p\u003e\u003cp\u003eThis empirical illustration validates CAPIRE not only as a conceptual framework but as an operational blueprint for data-driven retention policies. The next section (Section \u003cspan refid=\"Sec69\" class=\"InternalRef\"\u003e8\u003c/span\u003e) situates these findings within broader educational theory and discusses how the CAPIRE approach can be generalized and scaled to other institutional and national contexts.\u003c/p\u003e\u003c/div\u003e"},{"header":"8. DISCUSSION","content":"\u003cp\u003eThe empirical validation at FACULTY B-UNIVERSITY X (Section \u003cspan refid=\"Sec36\" class=\"InternalRef\"\u003e7\u003c/span\u003e) shows that CAPIRE fulfils its foundational goals: leakage-free feature engineering, interpretable trajectory archetypes, and accurate early-warning predictions. In this section, we synthesize the main theoretical contributions, position CAPIRE vis-\u0026agrave;-vis alternative approaches, and discuss implications for institutional practice, portability, and ethics.\u003c/p\u003e\u003cdiv id=\"Sec70\" class=\"Section2\"\u003e\u003ch2\u003e8.1. Multilevel Feature Engineering: Theoretical and Empirical Validation\u003c/h2\u003e\u003cdiv id=\"Sec71\" class=\"Section3\"\u003e\u003ch2\u003e8.1.1. Interaction Effects as Primary Drivers\u003c/h2\u003e\u003cp\u003eA central claim of CAPIRE is that educational outcomes emerge from \u003cb\u003ecross-level interactions\u003c/b\u003e, not from isolated main effects. The FACULTY B-UNIVERSITY X results support this claim: interaction features represent a minority of the feature set yet accoUniversity X for a disproportionate share of predictive importance.\u003c/p\u003e\u003cp\u003eInteractions such as \u003cem\u003ecurricular friction \u0026times; dropout behaviour\u003c/em\u003e (e.g., IFC \u0026times; proportion of \u0026ldquo;libre\u0026rdquo; courses) and \u003cem\u003eage at entry \u0026times; number of retries\u003c/em\u003e encode person\u0026ndash;context fit: the same institutional conditions (e.g., high-friction courses) have different consequences for older students with family or work responsibilities than for younger students with fewer constraints.\u003c/p\u003e\u003cp\u003eThis pattern aligns with life course theory (Elder, 1998) and ecological systems theory (Bronfenbrenner, 1979), both of which emphasize that development reflects the alignment between individual characteristics and layered contextual demands.\u003c/p\u003e\u003cp\u003eMethodologically, this has two consequences:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eModels that ignore interactions (e.g., simple logistic regression without interaction terms) are structurally underpowered.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003ePre-computing theoretically motivated interactions, rather than relying solely on tree-based models to discover them implicitly, improves interpretability: features such as \u003cem\u003eage \u0026times; retries\u003c/em\u003e have a clear narrative interpretation that advisors can understand and use.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec72\" class=\"Section3\"\u003e\u003ch2\u003e8.1.2. Trajectory Dynamics Rival Snapshot Performance\u003c/h2\u003e\u003cp\u003eTraditional early-warning systems often rely on static indicators such as GPA at a particular semester. CAPIRE adds \u003cb\u003eN4 trajectory features\u003c/b\u003e that capture how students move through the curriculum: gaps, re-enrolments, entropy of states, and velocity of progress.\u003c/p\u003e\u003cp\u003eEmpirically, N4 features contribute nearly as much predictive power as N3 performance snapshots. Two students with similar GPA at \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}\\)\u003c/span\u003e\u003c/span\u003ecan belong to very different archetypes: one with linear, gap-free progress and another with repeated enrolments, mixed outcomes and long interruptions. The latter is far more likely to drop out, even if grades at a given point are comparable.\u003c/p\u003e\u003cp\u003eThis supports longitudinal perspectives (Singer \u0026amp; Willett, 2003) and shows that \u003cb\u003epatterns over time\u003c/b\u003e contain crucial information beyond static performance. For practice, it implies that advisors should pay attention to \u003cem\u003ehow\u003c/em\u003e students progress, not just to \u003cem\u003ewhat\u003c/em\u003e their current grades are.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec73\" class=\"Section3\"\u003e\u003ch2\u003e8.1.3. Socioeconomic Context Operates Indirectly\u003c/h2\u003e\u003cp\u003eDespite the strong literature on socioeconomic barriers to persistence (Bourdieu, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e1986\u003c/span\u003e; Lareau, 2011), N1 structural features have low direct importance in the predictive model. This does not refute socioeconomic theories; instead, it suggests an \u003cb\u003eindirect, mediated role\u003c/b\u003e.\u003c/p\u003e\u003cp\u003eHigh neighborhood deprivation (NBI) is associated with a higher probability of working while studying, which in turn is associated with a higher proportion of dropped courses and greater gaps in enrolment. These downstream N2\u0026ndash;N4 variables, not N1 alone, are what directly drive attrition in the model.\u003c/p\u003e\u003cp\u003eIn causal terms, N1 functions as a distal determinant, shaping exposure to risk mechanisms further down the trajectory. Removing N1 from the feature set reduces overall performance, but its contribution is mostly channeled through mediating features rather than appearing as a top-ranked predictor on its own.\u003c/p\u003e\u003cp\u003eFor policy, this reinforces the idea that \u003cb\u003estructural interventions\u003c/b\u003e (e.g., financial support that reduces the need to work long hours) are complementary to academic interventions: they operate upstream in the causal chain.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec74\" class=\"Section2\"\u003e\u003ch2\u003e8.2. Advantages Over Black-Box and Theory-Free Approaches\u003c/h2\u003e\u003cdiv id=\"Sec75\" class=\"Section3\"\u003e\u003ch2\u003e8.2.1. Interpretability, Trust, and Institutional Adoption\u003c/h2\u003e\u003cp\u003eCompared with black-box models such as deep neural networks (Hu \u0026amp; Rangwala, 2020), CAPIRE offers a combination of \u003cb\u003ehigh predictive accuracy and high interpretability\u003c/b\u003e. Feature importance analyses identify a small set of conceptually clear variables and interactions that explain most of the model\u0026rsquo;s performance. Archetypes themselves provide a human-readable typology of student trajectories.\u003c/p\u003e\u003cp\u003eQualitative feedback from advisors at FACULTY B-UNIVERSITY X confirms that archetypes match their tacit categories (e.g., \u0026ldquo;chronic repeaters\u0026rdquo;, \u0026ldquo;good but overwhelmed students\u0026rdquo;, \u0026ldquo;exemplary trajectories\u0026rdquo;), which increases trust and willingness to use the system. This contrasts with previous pilots using opaque models, which advisors found difficult to interpret and, consequently, to act upon.\u003c/p\u003e\u003cp\u003eInterpretability is not a cosmetic advantage. In high-stakes settings such as academic progression, institutional actors must be able to \u003cb\u003eexplain and justify\u003c/b\u003e decisions. Archetypes and their defining features offer precisely that: a language that bridges statistical output and pedagogical action.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec76\" class=\"Section3\"\u003e\u003ch2\u003e8.2.2. Theory-Driven Feature Engineering vs. Purely Data-Driven Selection\u003c/h2\u003e\u003cp\u003eMany educational data mining (EDM) studies start from hundreds of candidate variables and rely on automated selection. CAPIRE follows the opposite path: it starts from a constrained, theory-driven feature dictionary anchored in multilevel models of student persistence.\u003c/p\u003e\u003cp\u003eThe FACULTY B-UNIVERSITY X results show that a relatively compact, theoretically guided set of 44 features can match or surpass the performance of broader, theory-free feature sets reported in the literature. This has three advantages:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTransferability\u003c/b\u003e: Features defined in terms of concepts like structural vulnerability, friction, and trajectory dynamics can be re-instantiated across institutions and coUniversity Xries, whereas highly specific behavioural traces (e.g., click patterns in a particular learning platform) are often not portable.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eStability\u003c/b\u003e: A theory-driven dictionary changes slowly; in contrast, data-driven feature sets can fluctuate from cohort to cohort, creating confusion and undermining institutional memory.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eProtection against spurious correlations\u003c/b\u003e: By constraining the design space to theoretically plausible mechanisms, CAPIRE reduces the risk of learning artefacts that are predictive in one context but meaningless or unfair in another.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eThis does not mean that exploratory, data-driven discovery is useless. Rather, for \u003cb\u003eoperational early-warning systems\u003c/b\u003e, theory-driven feature engineering offers a more stable and ethically defensible foundation.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec77\" class=\"Section2\"\u003e\u003ch2\u003e8.3. Implications for Early-Warning Systems and Targeted Interventions\u003c/h2\u003e\u003cdiv id=\"Sec78\" class=\"Section3\"\u003e\u003ch2\u003e8.3.1. Lead Time and Proactive Support\u003c/h2\u003e\u003cp\u003eSetting \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}=1.5\\)\u003c/span\u003e\u003c/span\u003eyears provides a \u003cb\u003esubstantial lead time\u003c/b\u003e between reliable risk identification and the typical dropout event (around 2.8 years after enrolment in our sample). This means that the system flags students when there is still a realistic window to implement meaningful support.\u003c/p\u003e\u003cp\u003eThis contrasts with reactive approaches that trigger alerts only after repeated failure or near-irreversible disengagement. By incorporating trajectory dynamics and friction metrics early, CAPIRE allows institutions to \u003cb\u003emove from \u0026ldquo;late diagnosis\u0026rdquo; to proactive care\u003c/b\u003e.\u003c/p\u003e\u003cp\u003eThe sensitivity analysis using alternative VOTs suggests that \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}=1.5\\)\u003c/span\u003e\u003c/span\u003eyears offers a good compromise: signals are strong enough for high predictive accuracy, while the intervention window remains sufficiently wide.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec79\" class=\"Section3\"\u003e\u003ch2\u003e8.3.2. Archetype-Based Interventions Rather Than One-Size-Fits-All\u003c/h2\u003e\u003cp\u003eTraditional risk scores compress heterogeneous trajectories into a single number, often routing all \u0026ldquo;high-risk\u0026rdquo; students into a generic intervention. CAPIRE, by contrast, distinguishes \u003cb\u003equalitatively different risk profiles\u003c/b\u003e:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eHigh-friction archetypes (e.g., Arquetipo 5) require intensive academic support in filter courses.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eExtra-academic risk archetypes (e.g., Arquetipo 2) call for counseling, social support, and flexible policies.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eTotal disengagement archetypes (e.g., Arquetipo 9) point to the need for strengthened onboarding and bridge programs.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eTreating these groups as equivalent would blur specific needs and dilute the impact of interventions. Archetype-based design enables \u003cb\u003edifferentiated, targeted strategies\u003c/b\u003e, and it also clarifies which combinations of mechanisms are being addressed (e.g., friction, economic stress, trajectory instability).\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec80\" class=\"Section3\"\u003e\u003ch2\u003e8.3.3. Understanding the DBSCAN Outlier Group\u003c/h2\u003e\u003cp\u003eThe analysis of students labelled as noise by DBSCAN reveals that they form a \u003cb\u003ecoherent minority pattern\u003c/b\u003e rather than random irregularities. Their trajectories tend to be continuous and stable, with small gaps and low variance in key indicators, even if they do not conform to the density structure of the main archetypes in the UMAP space.\u003c/p\u003e\u003cp\u003eSubsequent re-clustering identified at least two sharply separated micro-archetypes within this group. This suggests that density-based clustering, while effective for discovering dominant patterns, can leave minority but meaningful trajectories at the margins.\u003c/p\u003e\u003cp\u003eFor CAPIRE, the outlier group is thus best understood as a \u003cb\u003edocumented residual population\u003c/b\u003e whose structure motivates further methodological work (e.g., hybrid clustering strategies) rather than as noise to be ignored.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec81\" class=\"Section2\"\u003e\u003ch2\u003e8.4. Relationship with Causal Inference\u003c/h2\u003e\u003cp\u003eAlthough this article focuses on prediction and segmentation, CAPIRE is designed to facilitate \u003cb\u003ecausal inference\u003c/b\u003e in future studies. Two properties are particularly important:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTemporal validity through VOT.\u003c/b\u003e Because all features are constructed using information available at or before \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}\\)\u003c/span\u003e\u003c/span\u003e, they are suitable for defining pre-treatment covariates in quasi-experimental designs. This is essential for methods such as propensity score matching, regression discontinuity, or difference-in-differences, where post-treatment information would invalidate identification assumptions.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eRich, multilevel covariate structure.\u003c/b\u003e The N1\u0026ndash;N4 dictionary provides a nuanced set of confounders and mediators relevant to treatment assignment (e.g., who receives tutoring, financial aid, or counseling) and to outcomes. This increases the plausibility of conditional ignorability assumptions in observational studies.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eIn practical terms, CAPIRE can serve as the \u003cb\u003edata backbone\u003c/b\u003e for evaluating the impact of specific institutional policies: once archetype-based interventions are implemented, researchers can exploit the existing feature infrastructure to design rigorous causal evaluations of those interventions.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec82\" class=\"Section2\"\u003e\u003ch2\u003e8.5. Portability and Generalization\u003c/h2\u003e\u003cp\u003eCAPIRE\u0026rsquo;s \u003cb\u003emultilevel taxonomy\u003c/b\u003e is conceptually general: structural context (N1), entry moment (N2), performance snapshots (N3) and trajectory dynamics (N4) are relevant in community colleges, research universities, online programs and graduate schools, although their operationalization will differ.\u003c/p\u003e\u003cp\u003eAdapting CAPIRE to new contexts primarily involves:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003emapping local structural indicators (e.g., census measures, financial aid schemes) to N1;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eencoding program-specific entry features (e.g., admission pathways, prior certifications) in N2;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ere-computing friction metrics (IFC) for the relevant set of courses in N3;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003epreserving the generic logic of gaps, entropy and velocity in N4.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eWe expect N1\u0026ndash;N2 features to vary considerably across systems, while N3\u0026ndash;N4 patterns (friction, progression, instability) will be more stable. Archetype coUniversity Xs and specific profiles will likely change, but the general finding that \u003cb\u003einteraction effects and trajectory dynamics matter\u003c/b\u003e should remain robust.\u003c/p\u003e\u003cp\u003eNonetheless, the current study is based on a single public engineering institution in Country Q. Replication in private, non-STEM and international settings is necessary to fully assess external validity.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec83\" class=\"Section2\"\u003e\u003ch2\u003e8.6. Ethical Considerations and Potential Harms\u003c/h2\u003e\u003cdiv id=\"Sec84\" class=\"Section3\"\u003e\u003ch2\u003e8.6.1. Fairness and Bias\u003c/h2\u003e\u003cp\u003ePredictive systems can inadvertently encode and reproduce historical inequities. CAPIRE addresses this risk in several ways:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eIt avoids direct use of sensitive attributes such as race or religion; N1 structural indicators are area-level rather than individual-level.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eFairness audits (not detailed here) suggest that archetype assignment and predictive errors do not differ substantially by gender or rural/urban origin.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eThe system is explicitly \u003cb\u003ehuman-in-the-loop\u003c/b\u003e: archetype labels are recommendations for advisors, not automatic decisions.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eResidual risks remain: structural variables may correlate with unobserved forms of discrimination, and targeted interventions could unintentionally overlook disadvantaged students who do not fit N1 criteria. Institutions using CAPIRE should therefore perform regular fairness audits and adjust policies if systematic disparities appear.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec85\" class=\"Section3\"\u003e\u003ch2\u003e8.6.2. Stigmatization and Labelling\u003c/h2\u003e\u003cp\u003eAssigning students to \u0026ldquo;high-risk\u0026rdquo; archetypes carries the danger of stigmatization and self-fulfilling prophecies. CAPIRE mitigates this by:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003erestricting archetype labels to internal use (students are not told their archetype);\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ere-estimating archetypes periodically so that labels can change as trajectories change;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eframing interventions in terms of support and opportunity rather than deficit (\u0026ldquo;We see you are facing challenges in math; here is a support program\u0026rdquo;), and by also recognizing resilience indicators in high-risk archetypes.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe ethical stance is that analytics should \u003cb\u003eexpand\u003c/b\u003e students\u0026rsquo; options, not constrain them.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec86\" class=\"Section3\"\u003e\u003ch2\u003e8.6.3. Resource Allocation\u003c/h2\u003e\u003cp\u003eArchetype-based targeting inevitably shapes how institutional resources are distributed. While this can increase effectiveness, it also raises questions about opportunity costs and the treatment of moderate-risk students.\u003c/p\u003e\u003cp\u003eCAPIRE is not a replacement for universal support systems; it is a mechanism for prioritising additional, specialized interventions. Institutions must monitor whether certain groups are systematically excluded from support and ensure that targeting does not become a justification for reducing baseline services.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec87\" class=\"Section2\"\u003e\u003ch2\u003e8.7. Limitations of the CAPIRE Framework\u003c/h2\u003e\u003cp\u003eSeveral limitations qualify the findings and suggest directions for improvement:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eData and infrastructure requirements.\u003c/b\u003e CAPIRE assumes reasonably complete, longitudinal administrative data and the capacity to link external sources. Under-resourced institutions may need simplified variants (e.g., omitting N1) or staged implementations.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eDynamic environments.\u003c/b\u003e Archetypes are estimated on historical data and may drift as curricula, policies or student populations change. Periodic re-estimation and monitoring of archetype distributions are necessary to detect and adapt to such shifts.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eCorrelation vs. causation.\u003c/b\u003e The present study is predictive and descriptive. While it highlights plausible mechanisms (e.g., friction, work-study balance, temporal instability), it does not by itself establish causal effects. Interventions inspired by CAPIRE should be rigorously evaluated, ideally with quasi-experimental or experimental designs.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eDespite these limitations, the FACULTY B-UNIVERSITY X implementation suggests that CAPIRE provides a \u003cb\u003ecoherent, leakage-aware and operationally usable\u003c/b\u003e framework for understanding and acting upon student attrition. It offers a middle ground between purely theoretical models and purely predictive black boxes, and it lays the groundwork for future causal and comparative research.\u003c/p\u003e\u003c/div\u003e"},{"header":"9. CONCLUSION AND FUTURE WORK","content":"\u003cdiv id=\"Sec89\" class=\"Section2\"\u003e\u003ch2\u003e9.1. Summary of Contributions\u003c/h2\u003e\u003cp\u003eThis paper introduced CAPIRE (Comprehensive Analytics Platform for Institutional Retention Engineering), a multilevel, leakage-aware framework for student attrition modeling. We operationalized CAPIRE through an empirical study at Universidad Nacional de Region Z, Facultad de Ciencias Exactas y Tecnolog\u0026iacute;a (FACULTY B-UNIVERSITY X), analyzing 1,343 engineering students across 15 cohorts (2004\u0026ndash;2019). The main contributions are:\u003c/p\u003e\u003cp\u003e\u003cb\u003eC1: Multilevel Feature Taxonomy (N1\u0026ndash;N4)\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe proposed a theoretically grounded feature dictionary with 44 variables organized into four levels:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN1 \u0026ndash; Pre-entry structural context\u003c/b\u003e: neighborhood deprivation, proxies of family capital, local labor-market indicators.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN2 \u0026ndash; Entry moment\u003c/b\u003e: age at enrolment, employment status, macro-economic context at \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{t}_{0}\\)\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN3 \u0026ndash; Academic performance and curricular friction\u003c/b\u003e: grades, course failures, drop (\u0026ldquo;libre\u0026rdquo;) patterns, and Instructional Friction Coefficients (IFC).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eN4 \u0026ndash; Trajectory dynamics\u003c/b\u003e: gaps between enrolments, state entropy, retries, and velocity of curricular advance.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eEmpirically, interaction terms (e.g., N3\u0026times;N4, N2\u0026times;N4) represent a minority of features but accoUniversity X for a substantial share of predictive importance, confirming that outcomes emerge from cross-level interplay rather than additive main effects. This gives an operational form to multilevel theories of persistence (Bronfenbrenner, 1979; Pascarella \u0026amp; Terenzini, 2005).\u003c/p\u003e\u003cp\u003e\u003cb\u003eC2: Vulnerability Observation Time (VOT) and Leakage Prevention\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe formalized \u003cb\u003eVOT\u003c/b\u003e as a temporal boundary for feature construction, enforcing the use of only pre-\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}\\)\u003c/span\u003e\u003c/span\u003e information. In the FACULTY B-UNIVERSITY X case, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{V}=1.5\\)\u003c/span\u003e\u003c/span\u003eyears (end of the second academic year) provides:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003ea strict barrier against future-information leakage;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ea 1.3-year lead time before the average dropout event (2.8 years after enrolment);\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ea reproducible configuration regime (versioned YAML configurations, code-level checks on cutoff dates).\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThis directly addresses the pervasive leakage problem in educational data mining, where performance is often overestimated by incorporating post-outcome data into features.\u003c/p\u003e\u003cp\u003e\u003cb\u003eC3: Empirical Validation via Trajectory Archetypes\u003c/b\u003e\u003c/p\u003e\u003cp\u003eUsing UMAP for dimensionality reduction and DBSCAN for density-based clustering on VOT-compliant features, we identified \u003cb\u003e13 trajectory archetypes\u003c/b\u003e that cover 63.1% of the student population. These archetypes are:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eStatistically robust\u003c/b\u003e: bootstrap stability (mean ARI\u0026thinsp;=\u0026thinsp;0.614), permutation tests (p\u0026thinsp;\u0026lt;\u0026thinsp;0.01), and robustness to hyperparameter changes.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTemporally stable\u003c/b\u003e: cross-cohort comparison (2004\u0026ndash;2010 vs. 2011\u0026ndash;2019) shows attrition-rate differences under 5 percentage points for major archetypes.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003ePredictively usable\u003c/b\u003e: a Random Forest classifier achieves 94.9% test accuracy in archetype assignment, with all archetypes reaching F1\u0026thinsp;\u0026ge;\u0026thinsp;0.70 and several high-risk types exhibiting near-perfect classification.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eQualitative validation with academic advisors shows that archetypes align with existing practitioner categories (e.g., \u0026ldquo;chronic repeaters\u0026rdquo;, \u0026ldquo;good but overwhelmed students\u0026rdquo;), bridging statistical structure and institutional knowledge.\u003c/p\u003e\u003cp\u003e\u003cb\u003eC4: Actionable Intervention Matrix\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe translated archetypes into differentiated intervention recommendations (e.g., intensive tutoring for friction-driven archetypes, counseling and support for extra-academic risk archetypes, enhanced onboarding for disengagement profiles). Rather than a single \u0026ldquo;high-risk\u0026rdquo; group, CAPIRE provides a matrix of \u003cb\u003erisk mechanisms \u0026times; intervention types\u003c/b\u003e, allowing institutions to design targeted, mechanism-aware responses instead of one-size-fits-all programs.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec90\" class=\"Section2\"\u003e\u003ch2\u003e9.2. Methodological and Theoretical Advances\u003c/h2\u003e\u003cdiv id=\"Sec91\" class=\"Section3\"\u003e\u003ch2\u003e9.2.1. Resolving the Interpretability\u0026ndash;Accuracy Trade-off\u003c/h2\u003e\u003cp\u003eCAPIRE shows that high predictive performance does not require black-box models. By combining:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003ea \u003cb\u003etheory-driven feature dictionary\u003c/b\u003e (N1\u0026ndash;N4, including key interactions);\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ea \u003cb\u003eleakage-aware temporal design\u003c/b\u003e (VOT); and\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ea \u003cb\u003etransparent classifier\u003c/b\u003e (Random Forest with feature importance and archetype profiles),\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003ewe obtain accuracy comparable to or exceeding deep learning approaches reported in the literature, while retaining clear interpretability. The usual trade-off between \u0026ldquo;explainable but weak\u0026rdquo; and \u0026ldquo;powerful but opaque\u0026rdquo; is weakened: much of the gain comes from better features and temporal design, not from more complex algorithms.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec92\" class=\"Section3\"\u003e\u003ch2\u003e9.2.2. Archetypes as a Middle Ground Between Risk Scores and Case Narratives\u003c/h2\u003e\u003cp\u003eCAPIRE\u0026rsquo;s archetypes sit between individual case studies and generic risk scores:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003ethey are \u003cb\u003equantitatively derived\u003c/b\u003e from high-dimensional data;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ethey remain \u003cb\u003equalitatively interpretable\u003c/b\u003e, with recognizable narratives (\u0026ldquo;young strivers\u0026rdquo;, \u0026ldquo;persistent friction\u0026rdquo;, \u0026ldquo;total disengagement\u0026rdquo;, \u0026ldquo;success models\u0026rdquo;);\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ethey are \u003cb\u003escalable\u003c/b\u003e, as a trained classifier can assign students to archetypes in real time.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThis reconciles person-centred and variable-centred traditions: institutions retain the richness of narrative categories while gaining the scalability and reproducibility of formal models.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec93\" class=\"Section2\"\u003e\u003ch2\u003e9.3. Practical Implications for Institutions\u003c/h2\u003e\u003cdiv id=\"Sec94\" class=\"Section3\"\u003e\u003ch2\u003e9.3.1. CAPIRE as Institutional Analytics Infrastructure\u003c/h2\u003e\u003cp\u003eCAPIRE should be understood as an \u003cb\u003eanalytics infrastructure\u003c/b\u003e, not as a one-off model. Its components are reusable:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eThe \u003cb\u003efeature dictionary\u003c/b\u003e can be adapted to other programs and institutions, preserving the N1\u0026ndash;N4 logic while changing local indicators.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eThe \u003cb\u003eVOT principle\u003c/b\u003e generalizes to other predictive tasks (course failure, time-to-degree, progression bottlenecks).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eThe \u003cb\u003epipeline architecture\u003c/b\u003e supports multiple downstream uses: archetype discovery, predictive modeling, and, in future work, causal evaluation of interventions.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eBecause the core is theory-based, it is more stable than ad hoc feature sets: institutions can update data and periodic parameter choices without rethinking the underlying conceptual structure.\u003c/p\u003e\u003cp\u003eAs the project progresses and the implementation is further consolidated, we plan to release a reference implementation of the core pipeline in an open repository, so that other institutions can inspect, adapt, and extend the framework under transparent conditions.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec95\" class=\"Section3\"\u003e\u003ch2\u003e9.3.2. From Generic Risk to Differentiated Support\u003c/h2\u003e\u003cp\u003eFor institutional practice, the key shift is from \u003cb\u003egeneric \u0026ldquo;at-risk\u0026rdquo; flags\u003c/b\u003e to \u003cb\u003emechanism-specific profiles\u003c/b\u003e. CAPIRE encourages administrators and advisors to ask:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eIs this student at risk because of curricular friction, extra-academic stress, early disengagement, or some combination?\u003c/em\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cem\u003eWhat type of support aligns with that mechanism (tutoring, counseling, financial aid, bridge programs, mentoring)?\u003c/em\u003e\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThis shift improves both the pedagogical quality and the ethical defensibility of early-warning systems, making it clearer why a student is flagged and what the institution intends to do about it.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec96\" class=\"Section2\"\u003e\u003ch2\u003e9.4. Future Research Directions\u003c/h2\u003e\u003cdiv id=\"Sec97\" class=\"Section3\"\u003e\u003ch2\u003e9.4.1. Cross-Institutional Validation\u003c/h2\u003e\u003cp\u003eThe main limitation of this study is its single-institution scope. Ongoing collaborations with universities in Latin America and North America will test CAPIRE in different contexts (public/private, STEM/non-STEM, different welfare regimes).\u003c/p\u003e\u003cp\u003eKey questions include:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003ewhether N3\u0026ndash;N4 dynamics (friction, gaps, entropy) generalize more strongly than N1\u0026ndash;N2 structures;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ehow many archetypes emerge in other contexts and how similar they are to the FACULTY B-UNIVERSITY X profiles;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ewhether the dominance of interaction terms in predictive importance is replicated across settings.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThese studies will clarify which components of CAPIRE are universal and which require strong local adaptation.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec98\" class=\"Section3\"\u003e\u003ch2\u003e9.4.2. Causal Inference and Policy Evaluation\u003c/h2\u003e\u003cp\u003eCAPIRE is presently descriptive and predictive; it does not identify causal effects. A natural next step is to exploit the VOT-compliant feature infrastructure in quasi-experimental or experimental designs, for example:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eregression discontinuity designs using institutional cut-offs for support programs;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003edifference-in-differences analyses comparing archetype-specific attrition before and after policy changes;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003erandomized or quasi-randomized trials of interventions targeted to specific archetypes.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThis would move from \u0026ldquo;who is likely to drop out?\u0026rdquo; to \u0026ldquo;what actually works, for whom, and under what conditions?\u0026rdquo;, closing the loop between analytics and evidence-based policy.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec99\" class=\"Section3\"\u003e\u003ch2\u003e9.4.3. Expansion to Other Outcomes and Methodological Refinements\u003c/h2\u003e\u003cp\u003eFuture work can extend CAPIRE beyond binary attrition to:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003emulti-state progression trajectories (on-time, delayed, dropout, graduation);\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ecourse-level performance prediction for adaptive teaching;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003elinks between archetypes and post-graduation outcomes where data are available.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eOn the methodological side, several extensions are promising:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003emore systematic use of topological and multiscale clustering methods that preserve archetype interpretability while capturing overlapping or hierarchical structures;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ehybrid models that combine human-interpretable features with latent representations learned by dimensionality reduction or shallow neural architectures;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003efairness-aware learning schemes that explicitly constrain disparities in prediction quality across demographic or structural groups.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe central constraint for all these refinements is non-negotiable: temporal validity (VOT) and interpretability must remain at the core of any extension.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec100\" class=\"Section2\"\u003e\u003ch2\u003e9.5. Closing Reflection\u003c/h2\u003e\u003cp\u003eStudent attrition is not a \u003cb\u003etechnical problem\u003c/b\u003e to be \"solved\" by algorithms. It is a \u003cb\u003ehuman problem\u003c/b\u003e rooted in socioeconomic inequality, inadequate institutional support, and misalignment between students' needs and universities' structures. \u003cb\u003eCAPIRE does not solve attrition\u003c/b\u003e\u0026mdash;it provides \u003cb\u003einfrastructure\u003c/b\u003e for institutions to understand patterns, target resources, and evaluate policies. What it offers is a disciplined way of \u003cb\u003eseeing\u003c/b\u003e:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003ethat trajectories are heterogeneous rather than homogeneous;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ethat risk mechanisms differ and must be addressed with different tools;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ethat early-warning systems, if temporally valid and interpretable, can support rather than replace human judgment.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe 13 archetypes at FACULTY B-UNIVERSITY X are not labels to stigmatize students but lenses to recognize heterogeneity, challenge one-size-fits-all policies, and design equitable interventions.\u003c/p\u003e\u003cp\u003eIf the framework helps retain some students who would otherwise have left\u0026mdash;not by blaming them, but by revealing structural frictions and unmet needs\u0026mdash;then the analytical effort will have been worthwhile. Algorithms cannot care; institutions and people can. A framework like CAPIRE is valuable only insofar as it amplifies that care, ensuring that patterns of struggle become visible early enough, and clearly enough, to act.\u003c/p\u003e\u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eUse of AI-assisted tools\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors used a large language model (ChatGPT, OpenAI) only for language polishing, editorial refinement, and assistance in restructuring parts of the manuscript for clarity. All conceptual, methodological, analytical, and interpretative decisions were made exclusively by the authors. The LLM did not generate any primary data, analyses, results, or conclusions. All AI-assisted suggestions were manually reviewed and validated by the authors to ensure accuracy and consistency with the scientific content of the work.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSoftware and computational tools\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll data processing, feature engineering, statistical analyses, and topological modelling were conducted using open-source scientific computing tools, including Python (NumPy, Pandas, Scikit-learn), Ripser, KeplerMapper, and custom scripts developed by the authors. No proprietary analytic software was used in the production of the results reported in this article. All computations were executed on local hardware. Scripts used to generate the feature matrices and TDA-derived descriptors are available from the corresponding author upon reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData privacy and ethics\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe institutional datasets used in this study contain sensitive student information protected by university regulations and national privacy laws. For this reason, the raw data cannot be shared publicly. Aggregated indicators, feature definitions, and non-identifiable analytic structures are available from the corresponding author upon reasonable request and subject to institutional approval.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthorship transparency\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNo LLM, automated tool, or external assistant meets authorship criteria as defined by the journal. All authors take full responsibility for the integrity, originality, and accuracy of the work.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eA.A. conceptualized the study, designed the multilevel analytical framework, and developed the CAPIRE methodology. A.A. conducted the data preprocessing, feature engineering, topological data analysis, and archetype modeling. A.A. performed the statistical analyses, prepared all figures and tables, and validated the empirical results. A.A. wrote the manuscript, revised all sections for intellectual content, and approved the final version of the article. All contributions meet the journal\u0026rsquo;s authorship criteria.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eAvailability of data and materialsThe datasets used in this study contain sensitive and identifiable student information protected by institutional regulations and national privacy laws. For this reason, the raw data cannot be made publicly available. Access to the datasets is restricted by the university\u0026rsquo;s data governance policies, which prohibit external sharing of student-level records. Aggregated, non-identifiable data descriptors, feature definitions, and analytical code can be made available from the corresponding author upon reasonable request and subject to institutional approval.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eAdadi, A., \u0026amp; Berrada, M. (2018). Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). \u003cem\u003eIEEE Access, 6\u003c/em\u003e, 52138\u0026ndash;52160. http://dx.doi.org/10.1109/ACCESS.2018.2870052\u003c/li\u003e\n\u003cli\u003eAndrade-Gir\u0026oacute;n, D., Sandivar-Rosas, J., Mar\u0026iacute;n-Rodriguez, W., Susanibar-Ramirez, E., Toro-Dextre, E., Ausejo-Sanchez, J., Villarreal-Torres, H., \u0026amp; Angeles-Morales, J. (2023). Predicting student dropout based on machine learning and deep learning: A systematic review. \u003cem\u003eEAI Endorsed Transactions on Scalable Information Systems, 10\u003c/em\u003e(5), 1\u0026ndash;11. https://doi.org/10.4108/eetsis.3586\u003c/li\u003e\n\u003cli\u003eApicella, A., Isgr\u0026ograve;, F., Prevete, R., \u0026amp; Sansone, C. (2024). Don\u0026rsquo;t push the button! Exploring data leakage risks in machine learning applications. \u003cem\u003eArtificial Intelligence in Medicine, 154\u003c/em\u003e, 102826. https://doi.org/10.1016/j.artmed.2023.102826\u003c/li\u003e\n\u003cli\u003eBourdieu, P. (1986). The forms of capital. In J. G. Richardson (Ed.), \u003cem\u003eHandbook of theory and research for the sociology of education\u003c/em\u003e (pp. 241\u0026ndash;258). Greenwood.\u003c/li\u003e\n\u003cli\u003eCaprotti, O. (2017). Shapes of educational data in an online calculus course. \u003cem\u003eJournal of Learning Analytics, 4\u003c/em\u003e(2), 78\u0026ndash;92. https://doi.org/10.18608/jla.2017.42.5\u003c/li\u003e\n\u003cli\u003eCarlsson, G. (2009). Topology and data. \u003cem\u003eBulletin of the American Mathematical Society, 46\u003c/em\u003e(2), 255\u0026ndash;308. https://doi.org/10.1090/S0273-0979-09-01249-X\u003c/li\u003e\n\u003cli\u003eChazal, F., \u0026amp; Michel, B. (2021). An introduction to topological data analysis: Fundamental and practical aspects for data scientists. \u003cem\u003eFrontiers in Artificial Intelligence, 4\u003c/em\u003e, 667963. https://doi.org/10.3389/frai.2021.667963\u003c/li\u003e\n\u003cli\u003eDoran, D. (2018). Retention in higher education: An agent-based model of social interactions and motivated agent behavior. \u003cem\u003eJournal of Artificial Societies and Social Simulation, 21\u003c/em\u003e(3), 5. http://dx.doi.org/10.18564/jasss.3731\u003c/li\u003e\n\u003cli\u003eEster, M., Kriegel, H.-P., Sander, J., \u0026amp; Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In \u003cem\u003eProceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD\u0026rsquo;96)\u003c/em\u003e (pp. 226\u0026ndash;231). AAAI Press.\u003c/li\u003e\n\u003cli\u003eGanley, C. M., D\u0026rsquo;Agostino, J. V., \u0026amp; Rittle-Johnson, B. (2017). Shape of educational data: Interdisciplinary perspectives on quantitative educational data. \u003cem\u003eJournal of Learning Analytics, 4\u003c/em\u003e(2), 6\u0026ndash;11. https://doi.org/10.18608/jla.2017.42.1\u003c/li\u003e\n\u003cli\u003eHern\u0026aacute;n, M. A., \u0026amp; Robins, J. M. (2020). \u003cem\u003eCausal inference: What if\u003c/em\u003e. Chapman \u0026amp; Hall/CRC.\u003c/li\u003e\n\u003cli\u003eIBM. (n.d.). \u003cem\u003eWhat is data leakage in machine learning?\u003c/em\u003e IBM. Retrieved November 13, 2025, from https://www.ibm.com/topics/data-leakage-machine-learning\u003c/li\u003e\n\u003cli\u003eKelly, A. E. (2017). Is learning data in the right shape? Problems with the shape of educational data. \u003cem\u003eJournal of Learning Analytics, 4\u003c/em\u003e(2), 154\u0026ndash;159. https://doi.org/10.18608/jla.2017.42.9\u003c/li\u003e\n\u003cli\u003eKnight, S., Wise, A. F., \u0026amp; Chen, B. (2017). Time for change: Why learning analytics needs temporal analysis. \u003cem\u003eJournal of Learning Analytics, 4\u003c/em\u003e(3), 7\u0026ndash;17. https://doi.org/10.18608/jla.2017.43.2\u003c/li\u003e\n\u003cli\u003eKoukaras, P., \u0026amp; Tjortjis, C. (2025). Data preprocessing and feature engineering for data mining: Techniques, tools, and best practices. \u003cem\u003eAI, 6\u003c/em\u003e(10), 257. https://doi.org/10.3390/ai6100257\u003c/li\u003e\n\u003cli\u003eLundberg, S. M., \u0026amp; Lee, S.-I. (2017). A unified approach to interpreting model predictions. In \u003cem\u003eAdvances in Neural Information Processing Systems 30\u003c/em\u003e (pp. 4765\u0026ndash;4774). Curran Associates.\u003c/li\u003e\n\u003cli\u003eMcInnes, L., Healy, J., \u0026amp; Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. \u003cem\u003eJournal of Open Source Software, 3\u003c/em\u003e(29), 861. https://doi.org/10.21105/joss.00861\u003c/li\u003e\n\u003cli\u003eOrganisation for Economic Co-operation and Development. (2003). \u003cem\u003eStudent engagement at school: A sense of belonging and participation. Results from PISA 2000\u003c/em\u003e. OECD Publishing.\u003c/li\u003e\n\u003cli\u003ePearl, J., \u0026amp; Mackenzie, D. (2018). \u003cem\u003eThe book of why: The new science of cause and effect\u003c/em\u003e. Basic Books.\u003c/li\u003e\n\u003cli\u003ePerry, L. B., \u0026amp; McConney, A. (2010). Does the SES of the school matter? An examination of socioeconomic status and student achievement using PISA 2003. \u003cem\u003eInternational Journal of Science and Mathematics Education, 8\u003c/em\u003e(3), 437\u0026ndash;462. https://doi.org/10.1007/s10763-010-9197-0\u003c/li\u003e\n\u003cli\u003eSirin, S. R. (2005). Socioeconomic status and academic achievement: A meta-analytic review of research. \u003cem\u003eReview of Educational Research, 75\u003c/em\u003e(3), 417\u0026ndash;453. https://doi.org/10.3102/00346543075003417\u003c/li\u003e\n\u003cli\u003eSusnjak, T. (2022). Learning analytics dashboard: A tool for providing actionable feedback to students. \u003cem\u003eEducation and Information Technologies, 27\u003c/em\u003e, 1271\u0026ndash;1296. https://doi.org/10.1007/s10639-021-10635-8\u003c/li\u003e\n\u003cli\u003eTinto, V. (1993). \u003cem\u003eLeaving college: Rethinking the causes and cures of student attrition\u003c/em\u003e (2nd ed.). University of Chicago Press.\u003c/li\u003e\n\u003cli\u003eUNESCO. (2019). \u003cem\u003eGlobal education monitoring report 2019: Migration, displacement and education\u003c/em\u003e. UNESCO Publishing.\u003c/li\u003e\n\u003cli\u003eWilensky, U., \u0026amp; Rand, W. (2015). \u003cem\u003eAn introduction to agent-based modeling: Modeling natural, social, and engineered complex systems with NetLogo\u003c/em\u003e. MIT Press.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Feature Engineering, Learning Analytics, Student Retention, Data Leakage, Early-Warning Systems, Archetype Discovery, Value of Observation Time (VOT), Multilevel Modelling, Educational Data Mining","lastPublishedDoi":"10.21203/rs.3.rs-8118343/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8118343/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003ePredictive models for student dropout, while often accurate, frequently rely on opportunistic feature sets and suffer from undocumented data leakage, limiting their explanatory power and institutional usefulness. This paper introduces a leakage-aware data layer for student trajectory analytics, which serves as the methodological foundation for the CAPIRE framework for multilevel modelling.\u003c/p\u003e\u003cp\u003eWe propose a feature engineering design that organizes predictors into four levels: N1 (personal and socio-economic attributes), N2 (\u003cb\u003eentry moment and academic history\u003c/b\u003e), N3 (\u003cb\u003ecurricular friction and performance\u003c/b\u003e), and N4 (institutional and macro-context variables)As a core component, we formalize the \u003cb\u003eValue of Observation Time (VOT)\u003c/b\u003e as a critical design parameter that rigorously separates observation windows from outcome horizons, preventing data leakage by construction.\u003c/p\u003e\u003cp\u003eAn illustrative application in a long-cycle engineering program (1,343 students, ~\u0026thinsp;57% dropout) demonstrates that VOT-restricted multilevel features support robust archetype discovery. A UMAP\u0026thinsp;+\u0026thinsp;DBSCAN pipeline uncovers 13 trajectory archetypes, including profiles of \"early structural crisis,\" \"sustained friction,\" and \"hidden vulnerability\" (low friction but high dropout). Bootstrap and permutation tests confirm these archetypes are statistically robust and temporally stable.\u003c/p\u003e\u003cp\u003eWe argue that this approach transforms feature engineering from a technical step into a central methodological artifact. This data layer serves as a disciplined bridge between retention theory, early-warning systems, and the future implementation of causal inference and agent-based modelling (ABM) within the CAPIRE program.\u003c/p\u003e","manuscriptTitle":"A Leakage-Aware Data Layer For Student Analytics: The Capire Framework For Multilevel Trajectory Modeling","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-11-18 06:05:46","doi":"10.21203/rs.3.rs-8118343/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"907c41f4-a41a-4b3f-9114-df8d524be258","owner":[],"postedDate":"November 18th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-11-24T11:23:43+00:00","versionOfRecord":[],"versionCreatedAt":"2025-11-18 06:05:46","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8118343","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8118343","identity":"rs-8118343","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.