AI-Driven Intelligent Assessment System for Evaluating Transdisciplinary Competencies in Project-Based Learning: An Empirical and Simulation Study

preprint OA: closed
Full text JSON View at publisher
Full text 176,761 characters · extracted from preprint-html · click to expand
AI-Driven Intelligent Assessment System for Evaluating Transdisciplinary Competencies in Project-Based Learning: An Empirical and Simulation Study | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article AI-Driven Intelligent Assessment System for Evaluating Transdisciplinary Competencies in Project-Based Learning: An Empirical and Simulation Study Ayman Abu hajaj‬‏, Nadine Sheikh This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9401633/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract The assessment of transdisciplinary competencies—such as critical thinking, collaboration, and creativity—remains a major challenge in higher education, particularly within project-based learning (PBL) environments where learning is complex, dynamic, and context-dependent. Traditional assessment approaches are limited in capturing such competencies due to their reliance on static and isolated measurement methods. This study proposes an AI-driven intelligent assessment system that integrates machine learning techniques with learning analytics to evaluate students’ competencies using multi-source educational data, including academic performance, interaction logs, and collaborative activity indicators. The system employs ensemble models (Random Forest and Gradient Boosting) to generate predictive assessments, supported by explainability mechanisms (SHAP and LIME) to enhance interpretability and instructional usability. To ensure responsible and equitable assessment, the system incorporates embedded mechanisms for bias mitigation and fairness calibration during the prediction process. The proposed framework was evaluated using a dual-method approach: (1) an agent-based simulation involving 500 synthetic learners to examine system performance under controlled conditions, and (2) a pilot empirical study with 47 undergraduate students enrolled in interdisciplinary PBL courses. The results demonstrate significant improvements over a baseline AI model across five key evaluation metrics: accuracy (+ 22.7%), fairness (+ 46.7%), reliability (+ 28.6%), processing efficiency (+ 30.8%), and explainability (+ 65.5%). Empirical findings further confirm the system’s effectiveness, with medium-to-large effect sizes (d = 0.68–0.91) across all measured dimensions. The study contributes to the field of educational technology by presenting a scalable and interpretable AI-based assessment system that enhances both the quality and fairness of evaluation in complex learning environments. The findings highlight the potential of integrating AI-driven analytics with responsible design principles to support data-informed educational decision-making. intelligent assessment AI governance transdisciplinary competencies project-based learning predictive modelling algorithmic fairness bias mitigation Figures Figure 1 1. Introduction Aspects of higher education assessment are one of its most significant and least reformed practices. Universities have far better enabled students to develop complex, cross-disciplinary skills — via project-based learning, collaborative inquiry and authentic problem-solving tasks — but have not gotten as good at measuring whether that development has taken place. The instruments by which we rely for such data were specifically crafted to serve a different purpose: they are efficient and consistent at measuring declarative knowledge, and this is precisely what educational systems required when knowledge transmission was their main aim. That goal may have shifted significantly, but the instruments haven’t also changed (Voogt & Roblin, 2012; World Economic Forum, 2024). This mismatch is not trivial in practice. Students who are truly strong on the most important dimensions — the ability to synthesise, pull together ideas from multiple areas, effectively get things done in environments requiring intense pressure, generate original ideas — may not fare well in the assessments that are supposed to measure whether they have memorised the right material. The opposite is equally valid: students trained in examination assessment performance may even seem more competent than their actual competency development justifies (Zawacki-Richter et al., 2019; Kovanović et al., 2021). Neither pattern is remedied through working more aggressively on established tools; either outcome is the consequence of using instruments not fit for the task. Artificial intelligence has garnered significant and prolonged interest as a potential way to navigate this challenge. AI-based systems can incorporate evidence from multiple sources, such as logs of platform interactions, patterns of collaborative contributions, submissions, and communication logs, and update their assessments over time as new evidence adds up (Luckin et al., 2016; Ifenthaler & Yau, 2020). Where a traditional assessment takes a snapshot, an AI system can, in principle, follow a trajectory. This is important in PBL contexts where the most educational developments occur progressively and across a broader range of activities rather than in an assessing moment (Viberg et al., 2018). But the risks are real and well documented. AI systems are trained on historical data, and historical education data mirrors existing inequalities: students from well-resourced backgrounds always compare favorably with their counterparts from disadvantaged backgrounds – and for both different preparation and access to social capital, test-familiarity and the kinds of cultural knowledge that standardized assessments reward. An AI trained on such data, without mechanisms to detect or correct these patterns, will systematically repeat them — big time and automatically, while seeming to be perfectly objective (Baker & Hawn, 2022; Dwivedi et al., 2023). But the semblance of objectivity is in a way the most dangerous thing about biased AI, is it makes bias more difficult to spot and harder to challenge. This danger has led to a literature around the topic as well as a policy dialogue around the idea of AI governance, where an approach is taken to ensure that intelligent systems are governed by principles of fairness, transparency and accountability (Floridi et al., 2018; OECD, 2021; UNESCO, 2023). In education assessment, in particular, governance is not an optional enhancement — it is the condition of legitimate deployment. The literature has been calling for embedding governance into AI systems, rather than forcing it on the final outputs (Dwivedi et al., 2023; Holmes et al., 2019). The argument is compelling. Yet the implementation seems rudimentary. A systematic review of 47 different AI governance frameworks published between 2016 and 2024 shows a similar refrain: governance and prediction fall under separate issues — governance is used for the output of AI systems, not integrated with the processes that produce those outputs. Systems that manage governance well often have technical limitations; systems that predict well tend to be ethically poor. There is no existing paradigm that offers an operational case in which AI-based evaluation coupled with embedded governance acts as an integrated framework of an architecture. In this regard, this gap lies the key challenge that drives this study. The study revolves around one specific research question: How can an AI-driven governance framework be designed to assess transdisciplinary competencies in project-based interdisciplinary learning environments while maintaining a meaningful balance between predictive accuracy and educational fairness? Three particular contributions emerge. First, the paper lays out a functioning framework that joins AI-based prediction and governance within one operating architecture, and demonstrates that the integration is technically feasible. Second, it gives us a way to analyse complex, evolving competencies that adapts continuously to the dynamics of PBL environments. Third, it provides empirical evidence—both from agent-based simulation and a pilot field study—of what a governance-integrated AI assessment system can accomplish, and under what conditions its performance advantages are most pronounced. 2. Theoretical Framework 2.1 Transdisciplinary Competencies in Contemporary Education In the last two decades, there is a growing consensus across worldwide education policy frameworks and employer surveys, not to mention curriculum research: students are most required to know beyond their four disciplines if they want success and success in life as 21st-century workers. In fact, critical thinking, sharing and problem-solving skills, being creative, and capability to bring together knowledge from different fields are now central to more or less all forms of higher education. It has moved from just something to look forward to for students in developed systems rather than as an afterthought. This trend reflects real change in sorts of problems graduates face and the kinds of workplaces they decide to pursue. Interdisciplinary, project-based learning has emerged as one of the most sought-after pedagogical approaches on how to develop these competency (or competence) skills as students interact constructively with the world around them. PBL places students in situations where sustained interaction with intricate, real-world problems is necessary that cannot be approached by the knowledge of any one discipline (Kokotsaki et al., 2016; Perignat & Katz-Buonincontro, 2019). The evidence supporting the impact of well-designed PBL environments in fostering transdisciplinary competency development is fairly robust. The evidence that our assessment tools can reliably assess this development is far poorer. Qualities of this kind grow non-linearly, contextually, and over time, elements that defy the design assumptions of traditional measurement instruments (Zawacki-Richter et al., 2019; Kovanović et al., 2021). 2.2 AI-powered assessment: Limitations and advantages In theory, learning analytics and machine learning have significantly broadened what assessment systems are capable of. Combining data from digital platform interaction streams, collaborative activity logs, submission histories, and communication patterns within AI-based systems, a greater picture of learner development, over multiple time frames, is constructed than any single instructor can provide across a whole cohort (Luckin et al., 2016; Gašević et al., 2015). For PBL environments with substantial learning activity occurring post-formal submission, such multi-source integration is especially useful (Viberg et al., 2018). But the limitations are well established. Many successful AI models operate as black boxes: they generate predictions but keep their basis from anyone's view (Baker & Hawn, 2022). When such predictions decide whether students will move forward or be held back, this opacity isn’t a technical nuisance, it is an accountability failure. And because AI systems are trained on historical educational data that, by design, already contains demographic and socioeconomic disparities, they run the risk of systematically disadvantaging the very groups that traditional assessment penalizes, but no person in the system knows this is happening. 2.3 AI Governance in Education The term `AI Governance' refers to the concepts, structures and operating tools that give intelligent systems fairness, transparency, accountability and privacy standards (Floridi et al., 2018; OECD, 2021). In educational assessment governance is functionally compulsory, not add-on. The literature has started to move away from an authoritarian or hierarchical type, toward a new emphasis in the development and operation of AI systems on governance-by-design: fairness, transparency, and accountability should be infused at first through design. The practical difference counts big time - a governance-by-design system identifies bias on the fly and corrects it as predictions are made, while a governance-by-audit system can only raise problems that the AI systems have already produced in outputs, after students themselves have been affected. 2.4 Fairness in algorithms: Theoretical foundations The fairness obligations in governance layer are guided by basic scholarship on algorithmic fairness. Barocas, Hardt, and Narayanan (2019) have introduced the canonical taxonomy of fairness criteria — namely, anti-classification, calibration, and error-rate parity — and shown that their criteria do not, per se, automatically meet. Chouldechova (2017) empirically illustrated this incompatibility, finding that both equalized false-positive rates and calibration cannot take place under conditions with different group base rates. Dwork et al. (2012) came up with the individual fairness criterion - that similar people get similar predictions - that supplements group-level measurements by attending to within-group variation, which can be lost when aggregated measures are considered. The framework operationalises two complementary criteria: equalized odds (Hardt et al., 2016) at the group level and individual fairness (Dwork et al., 2012) at the learner level. Because the nature of these criteria can lead to conflicting signals for adjusting, a reconciliation mechanism will adapt the ratio of adjustments according to context in which the assessments take place — with group fairness to be weighted more heavily for high-stakes summative contexts and individual fairness to be weighted more heavily for formative feedback contexts. A full reconciliation procedure is described here in Section 3.3.1 . 2.5 Existing Frameworks: A Comparative Perspective Table 1 provides a systematic comparison of fourteen representative types of frameworks chosen from a review of 47 governance frameworks. The frameworks were selected to represent the various modes of implementation across the four dimensions that are most pertinent to the present study, namely, whether governance is integrated, or if the governance is more post hoc; whether the framework fits a PBL context in depth; whether it offers a practical implementation; and if governance deals with both group-level and individual fairness. Table 1 Comparative Analysis of Representative Existing Frameworks Framework / Study Domain Focus Governance Type PBL Context Operational? Fairness Level Interpretability Holmes et al. (2019) AI in Education Post-hoc / principled No No Group only Low Zawacki-Richter et al. (2019) Higher ed AI review Post-hoc / principled No No Not addressed Low Floridi et al. (2018) General AI ethics Post-hoc / principled No No Group only Low OECD (2021) Policy governance External regulation No No Group only Low UNESCO (2023) Education AI governance Post-hoc / principled No No Group only Medium Luckin et al. (2016) Intelligent tutoring Not addressed Partial Partial Not addressed Medium Ifenthaler & Yau (2020) Learning analytics Not addressed No Partial Not addressed Medium Baker & Hawn (2022) Bias in ed. AI Post-hoc No No Group only Low Barocas et al. (2019) Algorithmic fairness Embedded (theory) No No Group + Individual N/A Hardt et al. (2016) Fairness (ML) Embedded (theory) No No Group only N/A Dwork et al. (2012) Individual fairness Embedded (theory) No No Individual only N/A Kamiran & Calders (2012) Bias mitigation (ML) Embedded (technical) No Yes Group only Low Proposed Framework PBL competency assessment Embedded (operational) Yes Yes Group + Individual High Note. PBL = Project-Based Learning. 'Operational' indicates whether the framework provides a working implementation rather than principles or recommendations only. 'Fairness Level' indicates whether individual-level, group-level, or both fairness criteria are addressed. It reveals three patterns of comparison. First, the dominant mode of governance in present frameworks is post-hoc and principled governance: frameworks define standards for AI systems to ‘meet’ as opposed to embedding mechanisms that can help guarantee that these standards are met during the operation. Holmes et al. (2019) and Zawacki-Richter et al. (2019) — the most cited works related to the intersection of AI and education — both provide examples of such a pattern. Secondly, there is no existing framework that supports a governance implementation addressing PBL environments specifically. Third, the technical literature addressing algorithmic fairness (Barocas et al., 2019; Hardt et al., 2016; Dwork et al., 2012) is based on a compelling theoretical background that has not been explored in either the educational context or in PBL specifically. The theoretical framework offered fills in the gaps at this juncture. 2.6 Research Gap The domain of research depicted in Table 1 is divided into silos that are not in touch. Robust AI models are ethically underdeveloped. Ethical guidelines are not practical. Educationally contextualised work does not address the particular task of assessing transdisciplinary competencies in PBL contexts. This proposed framework will occupy the space where all three lines of work converge, namely, being technically demanding but also being ethically rooted, and educationally embedded. 3. The Proposed Predictive Governance Framework 3.1 Architecture Overview The framework consists of three functionally distinct but operationally interdependent layers. Critically, information does not flow in one direction only. Governance-calibrated outputs feed back into the predictive models, and data collection priorities adapt to what the predictive and governance layers identify as most informative. This bidirectional flow, illustrated in Fig. 1, is central to the framework's adaptive capacity and distinguishes it from sequential pipeline designs. 3.1.1 The Data Layer The Data Layer aggregates evidence from four source categories: academic performance records (grades, submission histories, prior trajectories), digital platform interaction logs (time-on-task, resource access, revision patterns), behavioural indicators (engagement frequency, responsiveness to feedback), and collaborative activity data (task distribution, contribution balance, peer communication). The rationale for multi-source integration is not merely comprehensiveness — it reflects the fundamental structure of competency development in PBL settings. Critical thinking may be visible in argument revision patterns within discussion logs but not in final submission grades. Collaborative capacity may be reflected in communication patterns but not in individual performance records. Multi-source integration is therefore constitutive of the framework's capacity to assess transdisciplinary competencies, not supplementary to it. Data is collected continuously throughout the project cycle, enabling longitudinal tracking rather than end-point measurement. 3.1.2 The AI Processing Layer The AI Processing Layer uses ensemble machine learning – specifically Random Forest and Gradient Boosting – generating predictions on learners' competency levels from the aggregated data. Those methods were chosen for their proven effectiveness in educational data mining, ability to work with high-dimensional data with non-linear interaction effects, their robustness to overfitting compared with single-model approaches and their relative interpretability against deep learning architectures (Luckin et al., 2016). The prediction function can be expressed as Ŷ = f(X), where X is the multi-source feature space and Ŷ is the predicted competency levels. Predictions are adjusted at each data collection interval and students who demonstrate major fluctuations in engagement mid-project would be able to see this in subsequent assessments not stuck in a preliminary early evaluation. For reproducibility and interpretability, the prediction protocol is based on a well-defined pipeline. First the multi-source data is preprocessed and normalized to ensure comparability across features. Second, feature vectors are produced and loaded to the ensemble models (Random Forest and Gradient Boosting). Third, predictions from both models are averaged out through a weighted method to result in a consistent prediction score. This prediction is then sent to the governance layer to provide calibration. The structured pipeline also provides this consistency, scalability and traceability for the process of assessment. 3.1.3 The Governance Layer The Governance Layer is the component that most distinguishes this framework from existing AI assessment systems. In most such systems, governance is applied to outputs already produced: a fairness audit is conducted after the fact, a bias review is performed after the fact, an explainability report is generated after the fact. In the proposed framework, governance operates continuously and concurrently with prediction. Bias mitigation mechanisms are active during model training. Fairness constraints shape model behaviour as predictions are generated. Interpretability mechanisms are part of the output generation process rather than a separate post-processing step. The governance-calibrated output is formalised as Y* = g(Ŷ), where g(·) represents the governance calibration function. The gap between Ŷ and Y* — between the raw prediction and the governance-calibrated output — is the measurable contribution of embedded governance. 3.2 Competency operationalization Each of the three target competencies is conceptualised as a composite indicator from interaction log data. The operationalisations were developed in two parts, the initial review of existing competency assessment tools provided the structure of the indicators, and a structured Delphi expert panel established the measurement criteria for the weighing of indicators. Finally, the obtained weights were validated according to the data of the field-study prior to its application in the final stages. 3.2.1 Empirical Assumption of Indicator Weights. The identification of the indicator weights were determined by a pre-defined Delphi process involving 12 experts who focused on the educational assessment (n = 4), educational data mining and learning analytics (n = 4) and project based interdisciplinary learning pedagogy (n = 4) domains. Experts were recruited based on proven research output in the relevant domain, at least five peer-reviewed publications over the prior 10 years. The Delphi protocol was based on the consensus protocol provided by Hasson et al. (2000). For Round 1 open-ended expert judgments on the relative contribution of each sub-indicator to its parent competency construct were gathered. Rounds two and three presented structured feedback paired with group statistics to facilitate iterative convergence. Consensus was determined under an interquartile range of ≤ 0.5 (0–1 importance) across all expert ratings. For Critical Thinking, experts reached a consensus of 0.35 for InquiryDepth, 0.35 for SourceDiversity, and 0.30 for RevisionRate (IQR = 0.2, 0.3 and 0.2 respectively at Round 3). That is, experts agree that the ability to ask detailed questions and base answers on a broad array of evidence are equally important — the ability to re-think if and as new information appears matters comparatively less and yet is no less important. For Collaboration, the top three weights (0.40 for TaskBalance, 0.35 for InteractionFrequency, and 0.25 for ResponseLatency⁻¹) were found to be the key signal of active sharing (IQR ≤ 0.3) as perceived by experts and the top 3, Equity of Contributions as the most important of all signals that they felt a “true collaborative functioning” takes place. On Creativity, values of 0.45 for NoveltyRate, 0.30 for SolutionDiversity and 0.25 for LexicalOriginality reached agreement (IQR ≤ 0.3), indicating that ideational novelty predominates in the measure of educational creativity. The resulting equations, normalized to [0,1] across data as per learners population prior to computation, are as follows: CT = 0.35 · InquiryDepth + 0.35 · SourceDiversity + 0.30 · RevisionRate COL = 0.40 · TaskBalance + 0.35 · InteractionFrequency + 0.25 · ResponseLatency⁻¹ CRE = 0.45 · NoveltyRate + 0.30 · SolutionDiversity + 0.25 · LexicalOriginality 3.2.2 Convergent Validity of Competency Operationalisation Convergent validity of the three composite indicators was examined using a pilot field study by comparing framework-produced competency scores against holistic instructor ratings obtained independently on a validated rubric (Jonsson & Svingby, 2007). At the end of the 10-week project cycle, each of the five supervising instructors evaluated the 47 participants' competency development across the same three dimensions, blind to the framework's outputs. Pearson correlations indicate that framework scores were positively correlated with instructor holistic ratings r = 0.71 (95% CI [0.54, 0.83]) for Critical Thinking, r = 0.74 (95% CI [0.58, 0.85]) for Collaboration, and r = 0.79 (95% CI [0.64, 0.88]) for Creativity. These correlations exceed a threshold often considered sufficient for convergent validity (r > 0.60; Campbell & Fiske, 1959) to support the observation that the composite measures reflect the constructs they are designed to describe. Internal consistency was acceptable and estimated using McDonald's omega for all three constructs: ω = 0.78 (CT), ω = 0.81 (COL), ω = 0.76 (CRE). These estimates are preliminary data and robust validation in multi-institutional, multi-instrument research is suggested as a focus in future studies. 3.3 Governance Mechanisms There are two stages of bias mitigation. During pre-processing, a Reweighing algorithm (Kamiran & Calders, 2012) modifies the instance weights of the training set to equalise the relative effects of protected and non-protected groups during the model training, addressing distributional imbalances without discarding the instances or changing observed labels. In the in-processing step there is an Adversarial Debiasing element (Zhang et al., 2018) which simultaneously trains the predictive model against a fairness adversary, penalising representations which encode the group membership. This two-stage methodology simultaneously deals with bias at the data level and model level. There are two concomitant criteria for fairness calibration operation. Equalized odds (Hardt et al., 2016) entails that true positive rates are of equal value and false positive rates of equal magnitude between groups: |TPRₐ − TPR_b| + |FPRₐ − FPR_b| ≤ ε, where ε = 0.05 in the current evaluation. Individual fairness (Dwork et al., 2012) requires that learners who are similar in the educational features of interest receive similar predictions. 3.3.1 Reconciliation of the Conflicting Fairness Standards In about 12 percent of simulation evaluation cycles, equalized odds and individual fairness produce inconsistent adjustment signals — by which the change needed to increase group-level equity makes the individual-level predictions less accurate, or the other way round. The reconciliation mechanism resolves such conflicts by way of a three-stage process. First, a context classification function C(t) classifies the type of assessment at cycle t, thus assigning each cycle to either summative (final determination of competency influencing grades or movement) or formative (interim feedback, with no effect on the grade) ones. This function learns from institutional deployment configuration provided to it at system initialisation (course assessment scheduling, grade contribution weights, instructor-generated feedback cycles). Second, a conflict weight is found to be α(t) = C(t) · w_G + (1 − C(t)) · w_I, which states C(t) = 1 in summative contexts and C(t) = 0 in formative contexts; w_G = 0.70 is the weight for group fairness (equalized odds), and w_I = 0.30 is the weight for individual fairness. In summative-stakes scenarios with the highest odds for the systematic exclusion from high-action decisions, group fairness is valued as 0.70 and individual fairness 0.30. If the focus is on personalised learner-level support in formative feedback situations, individual level fairness is valued at 0.70 and group fairness at 0.30. Third, g(Ŷ) governance calibration functions apply the weighted adjustment to yield Y* - with the magnitude of the adjustment bounded to ε = 0.05 to avoid over-correction and introducing new inequity. Over the 12% of cycles with conflicting criteria this eliminates an average of 34% deviation in conflict-induced prediction variance relative to unweighted arbitration. Interpretability results from a hybrid mechanism that employs SHAP (Lundberg & Lee, 2017) to achieve a global feature attribution function and LIME (Ribeiro et al., 2016) to capture learner-specific explanation. SHAP values are calculated from the whole population at each assessment cycle, and these provide system-level transparency for institutional audit. LIME produces locally faithful linear approximations of model behaviour for individual learners, which is translated into natural-language feedback by a template-based generation module using the established Explainability Evaluation Metrics framework (Hoffman et al., 2018). The hybrid approach is the best balance between both approaches: SHAP produces globally consistent attributions and LIME is computationally efficient at the individual level. Collectively, they promote both population-level accountability and learner-level actionability. 4. Research Hypotheses Here are four hypotheses from the theoretical structure: The first couple demonstrate the independent contribution of AI-based assessment and governance. The third looks at what their integration entails. The fourth focuses on a structural claim that sets this study apart from previous work: that governance does not merely provide improvement for assessment quality as an independent factor, but rather moderates the relationship between AI-based prediction and overall quality — underlining the enhancement of prediction effects in the presence of embedded governance. H1: AI-driven intelligent assessment has a strong positive impact on the accuracy of assessing transdisciplinary competence. H2: Governance mechanisms enhance assessment outcomes fairness, helping to minimize systematic bias among learner populations. H3: Integrating AI-based approaches with integrated governance mechanisms has positively influenced the overall quality of educational assessment (compared to one or the other in isolation). H4: Governance mechanisms moderate the relation between AI-based assessment and assessment quality and therefore, the positive influence of the predictive models is much stronger when governance is integrated into the system architecture. These hypotheses are then cumulative, logically. H1 and H2 demonstrate that each component adds something unique. H3 shows that together they get something that neither achieves in isolation. H4 treats governance as a structural condition not an additive improvement — which is the study’s most significant empirical finding. 5. Methodology Design Science Research (DSR) was chosen as the overarching methodological framework because the problem at the core of this study is a design problem. What is required is not a description of an existing phenomenon but the construction of an artefact — a working system — and the demonstration that it performs as intended under realistic conditions. Traditional empirical research methods are oriented toward explaining what already exists; DSR is oriented toward building what does not yet exist and subjecting it to rigorous scrutiny. 5.1 Design Science Research Process Problem identification. This work is motivated by two interrelated challenges. Traditional assessment tools lack the capacity to assess transdisciplinary skills in PBL settings. At the same time, AI-based assessment systems deployed without embedded governance reproduce existing educational inequalities at scale and in ways that are hard to detect or challenge. What is required is an intervention system that can not only address such complex competencies that will continue to develop, but that is also fair and transparent. Objective definition. The aim of the design is a methodology that increases the accuracy of assessment but at the same time ensures that outcomes are equitable, transparent and understandable — not by applying governance to outputs produced already, but by embedding it into the processes producing those outputs. Design and development. The artefact is the multi-layer predictive governance framework proposed in Section 3 . It brings together educational data analytics, machine learning, bias mitigation, fairness calibration and interpretability mechanisms all under one operational system. Interpretability is not considered a complementary feature, but a design constraint. Demonstration and evaluation. The framework is illustrated and tested against two baselines: a legacy rule-based system without AI or governance (Baseline 1, preserved for historical consideration) and a contemporary AI evaluation infrastructure without embedded governance (Baseline 2, fully described in Table 2 below). Baseline 2 is the main comparison, as it separates the contribution of governance integration from the contribution of AI capability in a wider sense. 5.2 Agent-Based Simulation Environment The simulation evaluation was conducted within an agent-based environment reflecting the dynamic, non-linear interactions characteristic of interdisciplinary PBL. Agent-based modelling was selected because it can generate the kinds of complex, temporally extended, multi-actor data that the framework is designed to process — data that cannot be obtained from existing datasets without ethical and practical difficulties that would preclude systematic experimental variation (Bonabeau, 2002). A population of 500 synthetic learner agents was modelled, each initialised with a distinct profile of academic performance, interaction patterns, collaboration tendencies, and behavioural characteristics, drawn from distributions calibrated against real higher-education data. These profiles evolved through agent-agent and agent-environment interactions across simulation cycles, generating the non-linear developmental trajectories that the framework is specifically designed to track. Four parameter dimensions were systematically varied across experimental scenarios: academic performance levels (high, moderate, low); interaction and collaboration patterns (individual-dominant, collaborative, mixed); task complexity (low, moderate, high); and contextual stress conditions (low, moderate, high). 5.3 Pilot Field Study The pilot field study aimed to answer the complementary question that simulation does not answer: whether these performance characteristics of the framework transfer to a real educational setting where there are real students, real instructors, and real projects. The two types of evaluation are intentionally complementary: simulation facilitates systematic variation in experimentation that would be neither ethical nor practical in an authentic setting, while the field study adds ecological validity that simulation cannot. 5.3.1 Participants and Setting This research was carried out in Semester 2 of the 2023–2024 academic year at a research-active university with established interdisciplinary PBL programmes. Forty-seven undergraduate students enrolled in 3 interdisciplinary PBL courses participated in the study: 24 in the experimental group (assessment supported by the framework) and 23 in the control group (assessment based on a conventional rubric). Across conditions, five faculty members served as supervising instructors. Full institutional ethics approval was obtained for this study (Protocol Reference: redacted for double-blind review). All participants provided written informed consent. 5.3.2 Sample Size Justification The sample of 47 participants was determined by a priori power analysis using G*Power 3.1 (Faul et al., 2007). For independent-samples t-tests (α = 0.05, 80% power) to detect a medium effect (d = 0.50), the number of at least 44 participants was chosen based on the results of the analysis. This number is lower than the achieved sample (n = 47). The minimum practically significant difference (0.50) was chosen as the standard effect size (consistent with the educational assessment literature and effect sizes found in similar AI-in-education intervention studies, d = 0.40–0.70; Zawacki-Richter et al., 2019). The sample is from a single institution and is not generalisable – this is a limitation explicitly stated in Section 9 . 5.3.3 Procedure A pre-post design was used with competency scores at the start and end of a 10-week project cycle. Experimental group instructors received guided feedback reports based on the framework, consisting of competency scores, SHAP-based feature attributions, and LIME-generated natural language explanations. Instructors in the control group used validated project rubrics (Jonsson & Svingby, 2007) without any AI support. Both groups were given identical project briefs and the same duration of instructor contact. A second assessor blind to group assignment independently analysed 30% of project outputs in a random subsample. Inter-rater reliability was acceptable for all three competency dimensions: Critical Thinking (ICC = 0.83; 95% CI [0.74, 0.89]), Collaboration (ICC = 0.79; 95% CI [0.69, 0.86]), and Creativity (ICC = 0.76; 95% CI [0.65, 0.84]). 5.4 Evaluation Metrics and Baseline Specifications Performance was assessed using 5 metrics, each used to consider technical and ethical aspects of quality of the assessment. Accuracy tracks the match of forecasted competencies with actual competence. Fairness measures equality in prediction errors across groups as the complement of group-level disparity. Reliability represents how consistently the simulation iterations occur. Average processing time per iteration of the assessment is measured by response time. Explainability is measured by SHAP attribution coverage and instructor-rated interpretability on the validated 7-point Explainability Evaluation Metrics scale (Hoffman et al., 2018). Overall assessment quality is defined as the weighted composite Q = w₁·A + w₂·F + w₃·R + w₄·T, weighting w₁ = 0.30 (Accuracy), w₂ = 0.30 (Fairness), w₃ = 0.25 (Reliability/Interpretability), w₄ = 0.15 (Response Time). Table 2 Technical Specifications for Baseline 2 (GPT-4 Standard — No Embedded Governance) Parameter Specification Model architecture GPT-4 (OpenAI) via API — standard configuration, no fine-tuning Prompt structure Zero-shot assessment prompts; competency rubric in system message; interaction log excerpts as user input Input data Identical multi-source feature set: academic records, interaction logs, behavioural indicators, collaborative activity data Output format Numerical competency score (0–1) per dimension with brief natural-language rationale; no SHAP/LIME attribution Bias mitigation None — no Reweighing, no Adversarial Debiasing Fairness constraints None — no equalized odds enforcement; no individual fairness calibration Governance integration Post-hoc human instructor review only Training data GPT-4 pre-training corpus (OpenAI, 2023); no domain-specific fine-tuning Evaluation conditions Identical: same 500-agent population, same feature set, same five metrics, same 50 replications Note. Baseline 2 was selected as the primary comparison because it represents the current state of practice in generative AI-based educational assessment deployed without embedded governance. The deliberate absence of bias mitigation and fairness constraints in Baseline 2 isolates the governance contribution of the proposed framework. To further isolate the contribution of embedded governance mechanisms, an additional baseline was introduced using ensemble machine learning models without governance integration. This baseline employs the same Random Forest and Gradient Boosting algorithms used in the proposed framework but excludes all governance components, including bias mitigation, fairness calibration, and explainability mechanisms. This comparison enables a more rigorous evaluation of whether performance improvements are attributable to the governance layer rather than the predictive modeling itself. Table 3 Technical Specifications for Baseline 3 (Ensemble ML without Governance) Parameter Specification Model architecture Random Forest + Gradient Boosting (ensemble) Input data Identical multi-source feature set Output format Numerical competency scores (0–1) Bias mitigation None Fairness constraints None Governance integration Not included Interpretability Not included (no SHAP/LIME) Training Standard supervised learning Evaluation conditions Identical to proposed framework 5.5 .Threats to Validity Several limitations should be considered when interpreting the results of this study. First, the pilot field study was conducted with a relatively small sample size (n = 47) from a single institutional context, which may limit the generalisability of the findings. Future research should validate the framework across multiple institutions and larger, more diverse populations. Second, the simulation environment, while designed to approximate realistic PBL dynamics, cannot fully capture the complexity of real-world educational settings. Although agent-based modelling enables controlled experimentation, it remains an abstraction of actual learner behaviour. Third, the operationalisation of transdisciplinary competencies relies on proxy indicators derived from interaction data, which may not fully represent the depth of cognitive and collaborative processes. While convergent validity was established through instructor ratings, further validation using multiple assessment instruments is recommended. Finally, the proposed governance mechanisms are implemented within a specific set of fairness criteria and thresholds. Alternative definitions of fairness or different parameter settings may lead to different outcomes, suggesting the need for context-sensitive calibration in future implementations. 6. Experimental Evaluation and Results 6.1 Primary Quantitative Results Table 3 presents the performance comparison between the proposed framework and Baseline 2 across five evaluation metrics. Values represent means across 50 simulation replications; all between-condition differences are statistically significant at p < 0.05 or better. Table 3 Performance Comparison: Proposed Framework vs. Baseline 2 (GPT-4 Standard, No Embedded Governance) Metric Baseline 2 (GPT-4) Proposed Framework Improvement Accuracy 0.75 0.92 + 22.7% Fairness 0.60 0.88 + 46.7% Reliability 0.70 0.90 + 28.6% Response Time 0.65 0.85 + 30.8% Explainability 0.55 0.91 + 65.5% Note. All differences statistically significant at p < 0.05; accuracy, fairness, and explainability at p < 0.01. Values are means across 50 simulation replications. Cohen's d range: 0.71–1.12. 95% CIs are reported in the supplementary materials. 6.2 Results Analysis This accuracy improvement, from 0.75 to 0.92 (+ 22.7%), indicates that the proposed model identifies complex, non-linear patterns in learner performance data that Baseline 2, operating without the discipline imposed by embedded governance, does not capture as consistently. Importantly, this improvement demonstrates that the integration of governance does not come at the expense of predictive performance, a point that becomes particularly important when paired with the fairness results. In absolute terms, fairness improvement is the largest gain (+ 46.7%). Baseline 2 yields significantly inequitable outcomes across learner subgroups — not because of intentional design but as an expected consequence of training on historical educational data that already reflects existing inequalities. This governance layer solves for this by shifting the rules of prediction generation from the output stage to the modelling stage instead of output filtering at that later time. The 46.7% increase is just what it looks like to change the process rather than audit the outputs. This result supports H2. The fact that improvements in reliability (+ 28.6%) and response time (+ 30.8%) were achieved reinforces that governance integration neither destabilises the assessment system nor slows it down. The consistency improvements across simulation iterations show that governance mechanisms impose a constructive type of discipline on the predictive models, lowering variance in outputs — something that matters in practical terms, since it’s hard to trust those systems to make consequential individual decisions when they produce different results under similar conditions. But the explainability improvement is the largest by a significant proportion (+ 65.5%). The SHAP-LIME hybrid mechanism significantly increases the proportion of prediction variance attributable to identifiable features and generates natural-language explanations that instructors in the field study characterized as being interpretable and actionable, as measured with the Hoffman et al. (2018) validated scale. An output of an assessment that cannot be explained to an educator who understands the learning context is an output that cannot be meaningfully contested — and one that cannot be contested cannot be trusted. When taken together, these results reinforce H3. 6.3 Moderation Analysis: Testing H4 The moderation hypothesis (H4) — governance as a structural moderating variable instead of additive variable — was examined through an analysis of comparisons of the association between AI-facilitated predictive performance and overall quality under presence and absence of embedded governance. Without governance (Baseline 2) — increasing predictive accuracy does not lead to increased assessment quality proportionally as no constraints exist on fairness and reliability. For contexts where governance is embedded, increasing predictive accuracy results more consistently in improvements in overall quality. Moderation analysis verified that the governance condition greatly improved the accuracy-to-quality relationship (β = 0.47, p < .001). Governance is not just an additive enhancement; it is a structural condition that alters the performance of a predictive component. 6.4 Longitudinal Performance Performance was measured on five consecutive testing cycles, on Weeks 2, 4, 6, 8 and 10 from the simulated project cycle. The proposed framework demonstrates a similar pattern of iterative improvement: fairness increases from 0.71 (Cycle 1) to 0.89 (Cycle 5) while improvements in fairness are greatest between Cycles 1 and 3 as the governance calibration mechanism has time to obtain enough data to enhance bias-mitigation decisions. Accuracy stabilises at 0.92 by Cycle 4. Explainability exhibits the steepest initial gain—from 0.72 all the way to 0.88 between Cycles 1 and 2—as interaction data gathers so quickly that the SHAP attribution model is constantly updated. Baseline 2 shows a marginal increase across the cycle (mean gain per cycle = 0.008 vs. 0.036 for the proposed framework) indicative of the lack of a feedback-adaptive governance mechanism. 6.5 Pilot Field Study Results 001, d = 0.68); Fairness — measured as cross-group score disparity reduction (t(45) = 5.61, p < .001, d = 0.91); Reliability (t(45) = 3.87, p = .001, d = 0.74); and Interpretability as rated by instructors on the Hoffman et al. (2018) validated 7-point scale (t(45) = 6.14, p < .001, d = 0.89). Effect sizes (d range: 0.68–0.91) are consistent with simulation findings — suggesting that the performance characteristics observed under controlled conditions reflect properties of the architecture that transfer to authentic settings. These results are interpreted with appropriate caution given the sample size and single-institution design. 7. Discussion The data are consistent, statistically robust, and generalised to both simulation and field contexts. And this consistency matters: a system that works well only in a good deal of circumstances is not a system we can rely on under the genuinely variable and unpredictable conditions of authentic educational environments. The value of the framework is highest in fairness and explainability—specifically, where the governance embedded in the framework best adds value—confirming that we have enhanced governance and it is not just that our predictive modelling has been improved. The finding of fairness does warrant special attention. Baseline 2, a technically sound and pervasive AI methodology, produces systematically less equitable outcomes across learner subgroups. It is not a design problem in any narrow sense; rather, it is the natural outcome of training a system on historical educational data with no means to ascertain and adjust for patterns of inequality contained in the data. The improvement on the governance layer is not filtering data resulting from an unchanged model. It alters the mechanism whereby predictions are created — what features the model depends on, how it addresses group-correlated patterns, how it reacts in the presence of bias-mitigation and fairness signals in conflict. The 46.7% improvement in fairness represents a quantifiable difference between running the process and auditing the output. There are also implications for the relationship between fairness and accuracy in these findings. It is widely believed—and given theoretical impossibility results in the algorithmic fairness literature (Chouldechova, 2017; Kleinberg et al., 2016) that the tradeoff for fairness improvement is sacrificing some accuracy. These results put pressure on that assumption — or at least qualify it quite convincingly. The impossibility results apply to post-hoc fairness interventions in fixed models; they don't necessarily apply when governance is built into the training process itself. In the case that Reweighing and Adversarial Debiasing are active during training, the model learns representations that are, at least in part, less sensitive to group-correlated spurious features — thus reducing bias while optimizing predictive accuracy on the features that are educationally relevant. This fairness-accuracy tradeoff is thus an artefact of the way AI systems in general are designed rather than an end result of the prediction problem itself. The results of the explainability can have both practical application and theoretical significance. The SHAP-LIME hybrid provides explanations that instructors rated as meaningful and usable and not only accepted as a fact of fact. An instructor who knows why a student received a competency assessment is then able to be aware of whether that reasoning actually reflects their understanding of that student's learning, if it doesn't, challenge that reasoning, and if they do, use that reasoning to make instructional choices. Such human-AI interaction positions the AI system as a collaborator in assessment — rather than an authority — in contrast to an automatic black-box output, where it is either accepted, rejected or neutral. Both response time and reliability measures are alleviated by the practical concern that such governance integration will impose computational overhead and introduce variability. Neither of these concerns is supported. And thus, the governance mechanisms, which are integrated with prediction rather than applied sequentially afterward, did not introduce the processing challenges that a layered architecture would cause. And the discipline imparted by constant fairness constraints seems to reduce rather than increase output variability — an intuitive finding, given that bias in your predictions is itself a source of inconsistency. Methodological comment: agent-based simulation should be noted. The decision was intentional: simulation allows systematic variability between experimental conditions in a way that would be ethically and practically impossible to achieve at this point in the framework's development where it was applicable in real-world educational contexts. That 500-agent population, calibrated against real distributions of higher education data, provides the kinds of complex, temporally long, non-linear data the framework is designed to accommodate. Such consistency between the effect size for simulation and the findings of field studies — itself an empirical significance — implies that the simulation captures something authentic about a framework architecture rather than only generating its own calibration parameter results. 8. Implications Practical implications. Schools that want to implement the framework can follow four stages. In Stage 1 (Infrastructure Assessment, 1 to 2 months), organizations run an infrastructure audit of their digital learning environment to see if it generates the four streams of data the framework asks for through its Data Layer: academic records, interaction logs, behavioural indicators, and collaborative activity data. Institutions that have interaction logs enabled on LMS applications (Canvas, Moodle, Blackboard) will usually have sufficient infrastructure – those predominantly in face-to-face or low-tech spaces need to accommodate this limitation before they can continue. In Stage 2 (Governance Parameter Configuration, about 1 month) the assessment coordinators set up the context classification function, giving the assessment calendar, degree of grading contribution weights, and formative/summative designation for each assessment event. In Stage 3 (Instructor Orientation, 2–3 classes), instructors are introduced to the guided training in reading framework output reports—specifically to interpreting SHAP attribution charts and natural-language explanations (e.g. LIME) in regard to the competency constructs being evaluated. In practice, the interpretability improvements we see presented in Section 6.5 are only applicable when the instructor is prepared for how they are to apply them to teachers’ instructional decisions. In Stage 4 (Monitored Deployment, one full project cycle) the framework will be put into place in parallel to existing assessment modes, and outcomes will be reviewed by instructors and human judgement prior to full incorporation. Full deployment without human intervention is not the approach that is recommended at this stage of the framework design. Policy implications. The difference between Baseline 2 (0.60 fairness) and the model we propose (0.88 fairness) is not a theoretical risk — it’s a palpable result of using AI assessment without embedded governance (i.e., one that has been documented in a controlled environment). Policymakers establishing standards for AI in education that reflect the international frameworks of UNESCO (2023) and the OECD (2021) should recognise embedded governance as a required precondition for responsible deployment, rather than an addendum. Outside regulation when applied following deployment is consistently inadequate: it may discover problems that have already delivered unfair results, but it cannot prevent them from happening. Educational implications. This framework, which facilitates shift from outcome-related to process-based assessment, has direct implications for the manner in which transdisciplinary competencies are conceptualized and assessed. Non-linear and contextually developed competencies must not be objectively captured by static knowledge tests; the framework establishes a useful operational alternative. Institutions will have to invest in fostering educators’ ability to interpret and critically engage with AI-generated assessment insights — a professional development demand that the framework’s interpretability gains make far more tractable than opaque AI systems do. Theoretical implications . This study reframes the relation between AI prediction and governance as one of the main theoretical contributions of the study. Governance is not something that can’t be done with AI assessment systems; if properly embedded it is a condition of what they do. This redirects the design question from how to trade off accuracy and ethical design to how to design systems in which both of these properties interact and reinforce each other. The contention that embedded governance enhances both fairness and accuracy at the same time — that the trade-off that would normally be presumed to be an artefact of governance-as-afterthought — is the most significant theoretical result of the study. 9. Conclusion This research seeks to fill a documented and underexplored gap in the literature on AI-driven educational assessment, in that no research framework has included AI-based prediction, embedded governance and the unique issues in assessing transdisciplinary competence in PBL contexts in a unified framework. The proposed predictive governance model demonstrated in this work shows that this integration is technically possible, and it has tangible performance gains in all five domains (accuracy, fairness, reliability, efficiency, and explainability), that determine if an assessment system can be trusted in complex learning environments. The most important finding of the research relates to the correlation between fairness and accuracy. Indeed, contrary to the premise—a belief that the algorithmic fairness landscape may not be possible and that these properties trade off with each other—as documented by the abstract algorithmic fairness literature, the evidence here suggests that they play into each other when governance is embedded in the training process, as opposed to once again applying it to outputs. This implies that the commonly observed fairness-accuracy tension may also be partially an artefact rather than an intrinsic quality of the prediction problem, particularly given that AI assessment systems are typically designed in specific ways. If correct, this translates into very real implications regarding how the field understands the design environment of responsible AI assessment. These conclusions are qualified by several restrictions. As the current assessment uses agent-based simulation, and as the preliminary ecological validity of the pilot field study is achieved, the results need to be further examined on a larger scale at different institutions and educational settings to reach strong generalizations. The pilot sample (n = 47, single institution, 10-week period) is too small for considerations of generalisability across institutional cultures, disciplinary contexts or students in which this model might fit into, and is likely appropriate only in a very limited number of instances. The framework is also based on data from digital interaction logs, which may not adequately reflect learning that occurs in face-to-face or low technology project contexts—a limitation that institutions should explicitly evaluate prior to implementation. More specifically, there are four avenues for future research: multi-institutional validation studies using larger and more representative sample sizes; expanding the input variable set with insights from learning evidence gained out of non-digital contexts; measuring the moderating effect of institutional culture and disciplinary norms on the success of governance-integrated assessment systems; and assessing the extent to which the performance benefits of the reconciliation mechanism are replicable across institutional deployment conditions that differ significantly from those used in this study. The last is key: context classification is trained on institutional settings specified in initialisation, and its generalisation to greatly unique institutional settings is still an open empirical question. What this study ultimately shows is that the decision confronting education institutions is not between AI ability and ethical responsibility — between systems that are accurate and systems that are fair and explainable. It is a design challenge. And it’s one for which this framework provides a deliberately constructed, testable answer. Declarations Author Contribution A.H. conceptualized the study, developed the methodology, designed the model, and wrote the original draft of the manuscript. N.S. conducted data analysis, contributed to validation, and participated in writing, reviewing, and editing the manuscript. All authors reviewed and approved the final version of the manuscript. Data Availability Statement Both the simulation datasets and the agent-based modelling code produced in this work can be accessed upon reasonable request by the corresponding author and will be deposited into an open-access repository upon acceptance. Data from the pilot field study are not publicly available because of institutional ethics approval conditions and participant privacy protections; de-identified summary statistics may be made available to qualified researchers through a formal data sharing agreement. Ethical approval and consent to participate: The pilot field study received full institutional ethics approval (Protocol Reference: redacted for double-blind review). All participants provided written informed consent and were explicitly informed of their right to withdraw at any time without academic consequence. Use of AI writing tools: AI-assisted language tools were used for grammar checking and reference formatting in preparing this manuscript. All intellectual content, analytical decisions, theoretical arguments, and empirical interpretations are the authors' own. The authors take full responsibility for the integrity and accuracy of all reported work. Competing interests: The authors declare no competing interests, financial or otherwise, related to the work reported in this manuscript. Funding: [Funding information — redacted for double-blind review] References Baker, R., & Hawn, A. (2022). Algorithmic bias in education. Educational Researcher, 51(4), 259–268. https://doi.org/10.3102/0013189X211059733 Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and machine learning: Limitations and opportunities. MIT Press. https://fairmlbook.org Baskerville, R., Baiyere, A., Gregor, S., Hevner, A., & Rossi, M. (2018). Design science research contributions: Finding a balance between artifact and theory. Journal of the Association for Information Systems, 19(5), 358–376. https://doi.org/10.17705/1jais.00495 Bonabeau, E. (2002). Agent-based modeling: Methods and techniques for simulating human systems. Proceedings of the National Academy of Sciences, 99(Suppl. 3), 7280–7287. https://doi.org/10.1073/pnas.082080899 Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105. https://doi.org/10.1037/h0046016 Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2), 153–163. https://doi.org/10.1089/big.2016.0047 Dwivedi, Y. K., et al. (2023). So what if ChatGPT wrote it? Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. International Journal of Information Management, 71, 102642. https://doi.org/10.1016/j.ijinfomgt.2023.102642 Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (pp. 214–226). ACM. https://doi.org/10.1145/2090236.2090255 Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39(2), 175–191. https://doi.org/10.3758/BF03193146 Floridi, L., et al. (2018). AI4People — An ethical framework for a good AI society. Minds and Machines, 28(4), 689–707. https://doi.org/10.1007/s11023-018-9482-5 Gašević, D., Dawson, S., & Siemens, G. (2015). Let's not forget: Learning analytics are about learning. TechTrends, 59(1), 64–71. https://doi.org/10.1007/s11528-014-0822-x Gregor, S., & Hevner, A. R. (2013). Positioning and presenting design science research for maximum impact. MIS Quarterly, 37(2), 337–355. https://doi.org/10.25300/MISQ/2013/37.2.01 Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems, 29, 3315–3323. Hasson, F., Keeney, S., & McKenna, H. (2000). Research guidelines for the Delphi survey technique. Journal of Advanced Nursing, 32(4), 1008–1015. https://doi.org/10.1046/j.1365-2648.2000.t01-1-01567.x Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004). Design science in information systems research. MIS Quarterly, 28(1), 75–105. https://doi.org/10.2307/25148625 Hoffman, R. R., Mueller, S. T., Klein, G., & Litman, J. (2018). Metrics for explainable AI: Challenges and prospects. arXiv preprint arXiv:1812.04608. Holmes, W., Bialik, M., & Fadel, C. (2019). Artificial intelligence in education: Promises and implications for teaching and learning. Center for Curriculum Redesign. Ifenthaler, D., & Yau, J. Y.-K. (2020). Utilising learning analytics to support study success in higher education: A systematic review. Educational Technology Research and Development, 68(4), 1961–1990. https://doi.org/10.1007/s11423-020-09788-z Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144. https://doi.org/10.1016/j.edurev.2007.05.002 Kamiran, F., & Calders, T. (2012). Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1), 1–33. https://doi.org/10.1007/s10115-011-0463-8 Kleinberg, J., Mullainathan, S., & Raghavan, M. (2016). Inherent trade-offs in the fair determination of risk scores. Proceedings of Innovations in Theoretical Computer Science. https://doi.org/10.4230/LIPIcs.ITCS.2017.43 Kokotsaki, D., Menzies, V., & Wiggins, A. (2016). Project-based learning: A review of the literature. Improving Schools, 19(3), 267–277. https://doi.org/10.1177/1365480216659733 Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Siemens, G. (2021). What is learning analytics? An integrative definition and conceptual framework. Journal of Learning Analytics, 8(1), 1–5. https://doi.org/10.18608/jla.2021.7140 Luckin, R., Holmes, W., Griffiths, M., & Forcier, L. B. (2016). Intelligence unleashed: An argument for AI in education. Pearson. Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 4765–4774. Mitchell, S., Potash, E., Barocas, S., D'Amour, A., & Lum, K. (2021). Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application, 8, 141–163. https://doi.org/10.1146/annurev-statistics-042720-125902 Organisation for Economic Co-operation and Development. (2021). Recommendation of the Council on Artificial Intelligence. OECD Publishing. https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449 Peffers, K., Tuunanen, T., Rothenberger, M. A., & Chatterjee, S. (2007). A design science research methodology for information systems research. Journal of Management Information Systems, 24(3), 45–77. https://doi.org/10.2753/MIS0742-1222240302 Perignat, E., & Katz-Buonincontro, J. (2019). STEAM in practice and research: An integrative literature review. Thinking Skills and Creativity, 31, 31–43. https://doi.org/10.1016/j.tsc.2018.10.002 Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. https://doi.org/10.18653/v1/D19-1410 Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). 'Why should I trust you?': Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135–1144). https://doi.org/10.1145/2939672.2939778 UNESCO. (2023). Guidance for generative AI in education and research. UNESCO Publishing. https://doi.org/10.54675/PCSP7350 van Gelder, T. (2005). Teaching critical thinking: Some lessons from cognitive science. College Teaching, 53(1), 41–48. https://doi.org/10.3200/CTCH.53.1.41-48 Viberg, O., Hatakka, M., Bälter, O., & Mavroudi, A. (2018). The current landscape of learning analytics in higher education. Computers in Human Behavior, 89, 98–110. https://doi.org/10.1016/j.chb.2018.07.027 Voogt, J., & Roblin, N. P. (2012). A comparative analysis of international frameworks for 21st century competences. Journal of Curriculum Studies, 44(3), 299–321. https://doi.org/10.1080/00220272.2012.668938 Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and applications. Cambridge University Press. World Economic Forum. (2024). The future of jobs report 2024. World Economic Forum. https://www.weforum.org/publications/the-future-of-jobs-report-2024/ Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education — where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 39. https://doi.org/10.1186/s41239-019-0171-0 Zhang, B. H., Lemoine, B., & Mitchell, M. (2018). Mitigating unwanted biases with adversarial learning. Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society (pp. 335–340). https://doi.org/10.1145/3278721.3278779 Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9401633","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":622238920,"identity":"3daede2f-4dca-4f37-831a-d6e36bd55fc2","order_by":0,"name":"Ayman Abu hajaj‬‏","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA5klEQVRIiWNgGAWjYNCCCjY5xvYGIMPAglgtZ/iMmXsOgLRIEKmDsU0usX1GAohJhBb+GdmJnwvYzBh7Zz6/uuFHgQQDf3t3Al4tEjdyN0vP4EljlpydU3azB+gwiTNnN+C35kbuBmkeiWNshrNz0m7wALUYSOTi1yIPtOU3j8F/HvubZ9Ju/iFGi8GN3G3SPAlsEowz2I/dJsoWwzNvt1nzHGAzYOzJYbstYyDBQ9AvcsdzN9/m/cdW39h+/NnNN39s5Pjbewl4XyABxuIxAJP4lYMA/wEYi/0BYdWjYBSMglEwIgEA7JpI99bCUQIAAAAASUVORK5CYII=","orcid":"","institution":"Ono Academic College","correspondingAuthor":true,"prefix":"","firstName":"Ayman","middleName":"Abu","lastName":"hajaj‬‏","suffix":""},{"id":622238921,"identity":"5dafa3e4-6af8-4d3d-9dcc-03cb931ea452","order_by":1,"name":"Nadine Sheikh","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Nadine","middleName":"","lastName":"Sheikh","suffix":""}],"badges":[],"createdAt":"2026-04-13 09:23:36","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9401633/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9401633/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":107052831,"identity":"bf9271b4-fec2-452e-ac40-5e39e9a67771","added_by":"auto","created_at":"2026-04-16 08:43:12","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":52570,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eThree-Layer Predictive Governance Framework Architecture with Bidirectional Information Flow\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eNote. Bidirectional arrows (⇄) indicate that information \u0026nbsp;flows in both directions. Governance-calibrated outputs (Y*) feed back into \u0026nbsp;the AI Processing Layer to recalibrate prediction parameters; the AI \u0026nbsp;Processing Layer simultaneously signals data collection priorities back to the \u0026nbsp;Data Layer.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-9401633/v1/54253f74cfc4ca50aaf0207d.png"},{"id":107399037,"identity":"320792c8-9a26-4cd9-91cd-5bfe42520e87","added_by":"auto","created_at":"2026-04-21 07:12:43","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":576591,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9401633/v1/5ad82c8d-3b1d-4c10-9bee-0a2c101c32de.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"AI-Driven Intelligent Assessment System for Evaluating Transdisciplinary Competencies in Project-Based Learning: An Empirical and Simulation Study","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eAspects of higher education assessment are one of its most significant and least reformed practices. Universities have far better enabled students to develop complex, cross-disciplinary skills \u0026mdash; via project-based learning, collaborative inquiry and authentic problem-solving tasks \u0026mdash; but have not gotten as good at measuring whether that development has taken place. The instruments by which we rely for such data were specifically crafted to serve a different purpose: they are efficient and consistent at measuring declarative knowledge, and this is precisely what educational systems required when knowledge transmission was their main aim. That goal may have shifted significantly, but the instruments haven\u0026rsquo;t also changed (Voogt \u0026amp; Roblin, 2012; World Economic Forum, 2024).\u003c/p\u003e\n\u003cp\u003eThis mismatch is not trivial in practice. Students who are truly strong on the most important dimensions \u0026mdash; the ability to synthesise, pull together ideas from multiple areas, effectively get things done in environments requiring intense pressure, generate original ideas \u0026mdash; may not fare well in the assessments that are supposed to measure whether they have memorised the right material. The opposite is equally valid: students trained in examination assessment performance may even seem more competent than their actual competency development justifies (Zawacki-Richter et al., 2019; Kovanović et al., 2021). Neither pattern is remedied through working more aggressively on established tools; either outcome is the consequence of using instruments not fit for the task.\u003c/p\u003e\n\u003cp\u003eArtificial intelligence has garnered significant and prolonged interest as a potential way to navigate this challenge. AI-based systems can incorporate evidence from multiple sources, such as logs of platform interactions, patterns of collaborative contributions, submissions, and communication logs, and update their assessments over time as new evidence adds up (Luckin et al., 2016; Ifenthaler \u0026amp; Yau, 2020). Where a traditional assessment takes a snapshot, an AI system can, in principle, follow a trajectory. This is important in PBL contexts where the most educational developments occur progressively and across a broader range of activities rather than in an assessing moment (Viberg et al., 2018). \u003c/p\u003e\n\u003cp\u003eBut the risks are real and well documented. AI systems are trained on historical data, and historical education data mirrors existing inequalities: students from well-resourced backgrounds always compare favorably with their counterparts from disadvantaged backgrounds \u0026ndash; and for both different preparation and access to social capital, test-familiarity and the kinds of cultural knowledge that standardized assessments reward. An AI trained on such data, without mechanisms to detect or correct these patterns, will systematically repeat them \u0026mdash; big time and automatically, while seeming to be perfectly objective (Baker \u0026amp; Hawn, 2022; Dwivedi et al., 2023). But the semblance of objectivity is in a way the most dangerous thing about biased AI, is it makes bias more difficult to spot and harder to challenge. \u003c/p\u003e\n\u003cp\u003eThis danger has led to a literature around the topic as well as a policy dialogue around the idea of AI governance, where an approach is taken to ensure that intelligent systems are governed by principles of fairness, transparency and accountability (Floridi et al., 2018; OECD, 2021; UNESCO, 2023). In education assessment, in particular, governance is not an optional enhancement \u0026mdash; it is the condition of legitimate deployment. The literature has been calling for embedding governance into AI systems, rather than forcing it on the final outputs (Dwivedi et al., 2023; Holmes et al., 2019). The argument is compelling. Yet the implementation seems rudimentary. \u003c/p\u003e\n\u003cp\u003eA systematic review of 47 different AI governance frameworks published between 2016 and 2024 shows a similar refrain: governance and prediction fall under separate issues \u0026mdash; governance is used for the output of AI systems, not integrated with the processes that produce those outputs. Systems that manage governance well often have technical limitations; systems that predict well tend to be ethically poor. There is no existing paradigm that offers an operational case in which AI-based evaluation coupled with embedded governance acts as an integrated framework of an architecture. In this regard, this gap lies the key challenge that drives this study.\u003c/p\u003e\n\u003cp\u003eThe study revolves around one specific research question: How can an AI-driven governance framework be designed to assess transdisciplinary competencies in project-based interdisciplinary learning environments while maintaining a meaningful balance between predictive accuracy and educational fairness? Three particular contributions emerge. First, the paper lays out a functioning framework that joins AI-based prediction and governance within one operating architecture, and demonstrates that the integration is technically feasible. Second, it gives us a way to analyse complex, evolving competencies that adapts continuously to the dynamics of PBL environments. Third, it provides empirical evidence\u0026mdash;both from agent-based simulation and a pilot field study\u0026mdash;of what a governance-integrated AI assessment system can accomplish, and under what conditions its performance advantages are most pronounced.\u003c/p\u003e"},{"header":"2. Theoretical Framework","content":"\u003cdiv id=\"Sec2\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Transdisciplinary Competencies in Contemporary Education\u003c/h2\u003e \u003cp\u003eIn the last two decades, there is a growing consensus across worldwide education policy frameworks and employer surveys, not to mention curriculum research: students are most required to know beyond their four disciplines if they want success and success in life as 21st-century workers. In fact, critical thinking, sharing and problem-solving skills, being creative, and capability to bring together knowledge from different fields are now central to more or less all forms of higher education. It has moved from just something to look forward to for students in developed systems rather than as an afterthought. This trend reflects real change in sorts of problems graduates face and the kinds of workplaces they decide to pursue. Interdisciplinary, project-based learning has emerged as one of the most sought-after pedagogical approaches on how to develop these competency (or competence) skills as students interact constructively with the world around them. PBL places students in situations where sustained interaction with intricate, real-world problems is necessary that cannot be approached by the knowledge of any one discipline (Kokotsaki et al., 2016; Perignat \u0026amp; Katz-Buonincontro, 2019). The evidence supporting the impact of well-designed PBL environments in fostering transdisciplinary competency development is fairly robust. The evidence that our assessment tools can reliably assess this development is far poorer. Qualities of this kind grow non-linearly, contextually, and over time, elements that defy the design assumptions of traditional measurement instruments (Zawacki-Richter et al., 2019; Kovanović et al., 2021).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.2 AI-powered assessment: Limitations and advantages\u003c/h2\u003e \u003cp\u003eIn theory, learning analytics and machine learning have significantly broadened what assessment systems are capable of. Combining data from digital platform interaction streams, collaborative activity logs, submission histories, and communication patterns within AI-based systems, a greater picture of learner development, over multiple time frames, is constructed than any single instructor can provide across a whole cohort (Luckin et al., 2016; Gašević et al., 2015). For PBL environments with substantial learning activity occurring post-formal submission, such multi-source integration is especially useful (Viberg et al., 2018). But the limitations are well established. Many successful AI models operate as black boxes: they generate predictions but keep their basis from anyone's view (Baker \u0026amp; Hawn, 2022). When such predictions decide whether students will move forward or be held back, this opacity isn\u0026rsquo;t a technical nuisance, it is an accountability failure. And because AI systems are trained on historical educational data that, by design, already contains demographic and socioeconomic disparities, they run the risk of systematically disadvantaging the very groups that traditional assessment penalizes, but no person in the system knows this is happening.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.3 AI Governance in Education\u003c/h2\u003e \u003cp\u003eThe term `AI Governance' refers to the concepts, structures and operating tools that give intelligent systems fairness, transparency, accountability and privacy standards (Floridi et al., 2018; OECD, 2021). In educational assessment governance is functionally compulsory, not add-on. The literature has started to move away from an authoritarian or hierarchical type, toward a new emphasis in the development and operation of AI systems on governance-by-design: fairness, transparency, and accountability should be infused at first through design. The practical difference counts big time - a governance-by-design system identifies bias on the fly and corrects it as predictions are made, while a governance-by-audit system can only raise problems that the AI systems have already produced in outputs, after students themselves have been affected.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.4 Fairness in algorithms: Theoretical foundations\u003c/h2\u003e \u003cp\u003eThe fairness obligations in governance layer are guided by basic scholarship on algorithmic fairness. Barocas, Hardt, and Narayanan (2019) have introduced the canonical taxonomy of fairness criteria \u0026mdash; namely, anti-classification, calibration, and error-rate parity \u0026mdash; and shown that their criteria do not, per se, automatically meet. Chouldechova (2017) empirically illustrated this incompatibility, finding that both equalized false-positive rates and calibration cannot take place under conditions with different group base rates. Dwork et al. (2012) came up with the individual fairness criterion - that similar people get similar predictions - that supplements group-level measurements by attending to within-group variation, which can be lost when aggregated measures are considered. The framework operationalises two complementary criteria: equalized odds (Hardt et al., 2016) at the group level and individual fairness (Dwork et al., 2012) at the learner level. Because the nature of these criteria can lead to conflicting signals for adjusting, a reconciliation mechanism will adapt the ratio of adjustments according to context in which the assessments take place \u0026mdash; with group fairness to be weighted more heavily for high-stakes summative contexts and individual fairness to be weighted more heavily for formative feedback contexts. A full reconciliation procedure is described here in Section \u003cspan refid=\"Sec17\" class=\"InternalRef\"\u003e3.3.1\u003c/span\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.5 Existing Frameworks: A Comparative Perspective\u003c/h2\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e provides a systematic comparison of fourteen representative types of frameworks chosen from a review of 47 governance frameworks. The frameworks were selected to represent the various modes of implementation across the four dimensions that are most pertinent to the present study, namely, whether governance is integrated, or if the governance is more post hoc; whether the framework fits a PBL context in depth; whether it offers a practical implementation; and if governance deals with both group-level and individual fairness.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cem\u003eComparative Analysis of Representative Existing Frameworks\u003c/em\u003e\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003e Framework / Study\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDomain Focus\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eGovernance Type\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003ePBL Context\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eOperational?\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eFairness Level\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eInterpretability\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHolmes et al. (2019)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAI in Education\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePost-hoc / principled\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eGroup only\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eLow\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eZawacki-Richter et al. (2019)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHigher ed AI review\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePost-hoc / principled\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eNot addressed\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eLow\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFloridi et al. (2018)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGeneral AI ethics\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePost-hoc / principled\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eGroup only\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eLow\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOECD (2021)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePolicy governance\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eExternal regulation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eGroup only\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eLow\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eUNESCO (2023)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEducation AI governance\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePost-hoc / principled\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eGroup only\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eMedium\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLuckin et al. (2016)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eIntelligent tutoring\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNot addressed\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003ePartial\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePartial\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eNot addressed\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eMedium\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eIfenthaler \u0026amp; Yau (2020)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLearning analytics\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNot addressed\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePartial\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eNot addressed\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eMedium\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBaker \u0026amp; Hawn (2022)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBias in ed. AI\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePost-hoc\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eGroup only\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eLow\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBarocas et al. (2019)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAlgorithmic fairness\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEmbedded (theory)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eGroup\u0026thinsp;+\u0026thinsp;Individual\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eN/A\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHardt et al. (2016)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFairness (ML)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEmbedded (theory)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eGroup only\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eN/A\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDwork et al. (2012)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eIndividual fairness\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEmbedded (theory)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eIndividual only\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eN/A\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKamiran \u0026amp; Calders (2012)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBias mitigation (ML)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEmbedded (technical)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eYes\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eGroup only\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eLow\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eProposed Framework\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003ePBL competency assessment\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003eEmbedded (operational)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003eYes\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003eYes\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003eGroup\u0026thinsp;+\u0026thinsp;Individual\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e\u003cb\u003eHigh\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eNote. PBL\u0026thinsp;=\u0026thinsp;Project-Based Learning. 'Operational' indicates whether the framework provides a working implementation rather than principles or recommendations only. 'Fairness Level' indicates whether individual-level, group-level, or both fairness criteria are addressed.\u003c/em\u003e \u003c/p\u003e \u003cp\u003eIt reveals three patterns of comparison. First, the dominant mode of governance in present frameworks is post-hoc and principled governance: frameworks define standards for AI systems to \u0026lsquo;meet\u0026rsquo; as opposed to embedding mechanisms that can help guarantee that these standards are met during the operation. Holmes et al. (2019) and Zawacki-Richter et al. (2019) \u0026mdash; the most cited works related to the intersection of AI and education \u0026mdash; both provide examples of such a pattern. Secondly, there is no existing framework that supports a governance implementation addressing PBL environments specifically. Third, the technical literature addressing algorithmic fairness (Barocas et al., 2019; Hardt et al., 2016; Dwork et al., 2012) is based on a compelling theoretical background that has not been explored in either the educational context or in PBL specifically. The theoretical framework offered fills in the gaps at this juncture.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e2.6 Research Gap\u003c/h2\u003e \u003cp\u003eThe domain of research depicted in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e is divided into silos that are not in touch. Robust AI models are ethically underdeveloped. Ethical guidelines are not practical. Educationally contextualised work does not address the particular task of assessing transdisciplinary competencies in PBL contexts. This proposed framework will occupy the space where all three lines of work converge, namely, being technically demanding but also being ethically rooted, and educationally embedded.\u003c/p\u003e \u003c/div\u003e"},{"header":"3. The Proposed Predictive Governance Framework","content":"\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\n \u003ch2\u003e3.1 Architecture Overview\u003c/h2\u003e\n \u003cp\u003eThe framework consists of three functionally distinct but operationally interdependent layers. Critically, information does not flow in one direction only. Governance-calibrated outputs feed back into the predictive models, and data collection priorities adapt to what the predictive and governance layers identify as most informative. This bidirectional flow, illustrated in Fig. 1, is central to the framework\u0026apos;s adaptive capacity and distinguishes it from sequential pipeline designs.\u003c/p\u003e\n \u003cdiv id=\"Sec10\" class=\"Section3\"\u003e\n \u003ch2\u003e3.1.1 The Data Layer\u003c/h2\u003e\n \u003cp\u003eThe Data Layer aggregates evidence from four source categories: academic performance records (grades, submission histories, prior trajectories), digital platform interaction logs (time-on-task, resource access, revision patterns), behavioural indicators (engagement frequency, responsiveness to feedback), and collaborative activity data (task distribution, contribution balance, peer communication). The rationale for multi-source integration is not merely comprehensiveness \u0026mdash; it reflects the fundamental structure of competency development in PBL settings. Critical thinking may be visible in argument revision patterns within discussion logs but not in final submission grades. Collaborative capacity may be reflected in communication patterns but not in individual performance records. Multi-source integration is therefore constitutive of the framework\u0026apos;s capacity to assess transdisciplinary competencies, not supplementary to it. Data is collected continuously throughout the project cycle, enabling longitudinal tracking rather than end-point measurement.\u003c/p\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Sec11\" class=\"Section3\"\u003e\n \u003ch2\u003e3.1.2 The AI Processing Layer\u003c/h2\u003e\n \u003cp\u003eThe AI Processing Layer uses ensemble machine learning \u0026ndash; specifically Random Forest and Gradient Boosting \u0026ndash; generating predictions on learners\u0026apos; competency levels from the aggregated data. Those methods were chosen for their proven effectiveness in educational data mining, ability to work with high-dimensional data with non-linear interaction effects, their robustness to overfitting compared with single-model approaches and their relative interpretability against deep learning architectures (Luckin et al., 2016). The prediction function can be expressed as Ŷ = f(X), where X is the multi-source feature space and Ŷ is the predicted competency levels. Predictions are adjusted at each data collection interval and students who demonstrate major fluctuations in engagement mid-project would be able to see this in subsequent assessments not stuck in a preliminary early evaluation. For reproducibility and interpretability, the prediction protocol is based on a well-defined pipeline. First the multi-source data is preprocessed and normalized to ensure comparability across features. Second, feature vectors are produced and loaded to the ensemble models (Random Forest and Gradient Boosting). Third, predictions from both models are averaged out through a weighted method to result in a consistent prediction score. This prediction is then sent to the governance layer to provide calibration. The structured pipeline also provides this consistency, scalability and traceability for the process of assessment.\u003c/p\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Sec12\" class=\"Section3\"\u003e\n \u003ch2\u003e3.1.3 The Governance Layer\u003c/h2\u003e\n \u003cp\u003eThe Governance Layer is the component that most distinguishes this framework from existing AI assessment systems. In most such systems, governance is applied to outputs already produced: a fairness audit is conducted after the fact, a bias review is performed after the fact, an explainability report is generated after the fact. In the proposed framework, governance operates continuously and concurrently with prediction. Bias mitigation mechanisms are active during model training. Fairness constraints shape model behaviour as predictions are generated. Interpretability mechanisms are part of the output generation process rather than a separate post-processing step. The governance-calibrated output is formalised as Y* = g(Ŷ), where g(\u0026middot;) represents the governance calibration function. The gap between Ŷ and Y* \u0026mdash; between the raw prediction and the governance-calibrated output \u0026mdash; is the measurable contribution of embedded governance.\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\n \u003ch2\u003e3.2 Competency operationalization\u003c/h2\u003e\n \u003cp\u003eEach of the three target competencies is conceptualised as a composite indicator from interaction log data. The operationalisations were developed in two parts, the initial review of existing competency assessment tools provided the structure of the indicators, and a structured Delphi expert panel established the measurement criteria for the weighing of indicators. Finally, the obtained weights were validated according to the data of the field-study prior to its application in the final stages.\u003c/p\u003e\n \u003cdiv id=\"Sec14\" class=\"Section3\"\u003e\n \u003ch2\u003e3.2.1 Empirical Assumption of Indicator Weights.\u003c/h2\u003e\n \u003cp\u003eThe identification of the indicator weights were determined by a pre-defined Delphi process involving 12 experts who focused on the educational assessment (n\u0026thinsp;=\u0026thinsp;4), educational data mining and learning analytics (n\u0026thinsp;=\u0026thinsp;4) and project based interdisciplinary learning pedagogy (n\u0026thinsp;=\u0026thinsp;4) domains. Experts were recruited based on proven research output in the relevant domain, at least five peer-reviewed publications over the prior 10 years. The Delphi protocol was based on the consensus protocol provided by Hasson et al. (2000). For Round 1 open-ended expert judgments on the relative contribution of each sub-indicator to its parent competency construct were gathered. Rounds two and three presented structured feedback paired with group statistics to facilitate iterative convergence. Consensus was determined under an interquartile range of \u0026le;\u0026thinsp;0.5 (0\u0026ndash;1 importance) across all expert ratings. For Critical Thinking, experts reached a consensus of 0.35 for InquiryDepth, 0.35 for SourceDiversity, and 0.30 for RevisionRate (IQR\u0026thinsp;=\u0026thinsp;0.2, 0.3 and 0.2 respectively at Round 3). That is, experts agree that the ability to ask detailed questions and base answers on a broad array of evidence are equally important \u0026mdash; the ability to re-think if and as new information appears matters comparatively less and yet is no less important. For Collaboration, the top three weights (0.40 for TaskBalance, 0.35 for InteractionFrequency, and 0.25 for ResponseLatency⁻\u0026sup1;) were found to be the key signal of active sharing (IQR\u0026thinsp;\u0026le;\u0026thinsp;0.3) as perceived by experts and the top 3, Equity of Contributions as the most important of all signals that they felt a \u0026ldquo;true collaborative functioning\u0026rdquo; takes place. On Creativity, values of 0.45 for NoveltyRate, 0.30 for SolutionDiversity and 0.25 for LexicalOriginality reached agreement (IQR\u0026thinsp;\u0026le;\u0026thinsp;0.3), indicating that ideational novelty predominates in the measure of educational creativity. The resulting equations, normalized to [0,1] across data as per learners population prior to computation, are as follows:\u003c/p\u003e\n \u003cp\u003eCT\u0026thinsp;=\u0026thinsp;0.35 \u0026middot; InquiryDepth\u0026thinsp;+\u0026thinsp;0.35 \u0026middot; SourceDiversity\u0026thinsp;+\u0026thinsp;0.30 \u0026middot; RevisionRate\u003c/p\u003e\n \u003cp\u003eCOL\u0026thinsp;=\u0026thinsp;0.40 \u0026middot; TaskBalance\u0026thinsp;+\u0026thinsp;0.35 \u0026middot; InteractionFrequency\u0026thinsp;+\u0026thinsp;0.25 \u0026middot; ResponseLatency⁻\u0026sup1;\u003c/p\u003e\n \u003cp\u003eCRE\u0026thinsp;=\u0026thinsp;0.45 \u0026middot; NoveltyRate\u0026thinsp;+\u0026thinsp;0.30 \u0026middot; SolutionDiversity\u0026thinsp;+\u0026thinsp;0.25 \u0026middot; LexicalOriginality\u003c/p\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Sec15\" class=\"Section3\"\u003e\n \u003ch2\u003e3.2.2 Convergent Validity of Competency Operationalisation\u003c/h2\u003e\n \u003cp\u003eConvergent validity of the three composite indicators was examined using a pilot field study by comparing framework-produced competency scores against holistic instructor ratings obtained independently on a validated rubric (Jonsson \u0026amp; Svingby, 2007). At the end of the 10-week project cycle, each of the five supervising instructors evaluated the 47 participants\u0026apos; competency development across the same three dimensions, blind to the framework\u0026apos;s outputs. Pearson correlations indicate that framework scores were positively correlated with instructor holistic ratings r\u0026thinsp;=\u0026thinsp;0.71 (95% CI [0.54, 0.83]) for Critical Thinking, r\u0026thinsp;=\u0026thinsp;0.74 (95% CI [0.58, 0.85]) for Collaboration, and r\u0026thinsp;=\u0026thinsp;0.79 (95% CI [0.64, 0.88]) for Creativity. These correlations exceed a threshold often considered sufficient for convergent validity (r\u0026thinsp;\u0026gt;\u0026thinsp;0.60; Campbell \u0026amp; Fiske, 1959) to support the observation that the composite measures reflect the constructs they are designed to describe. Internal consistency was acceptable and estimated using McDonald\u0026apos;s omega for all three constructs: \u0026omega;\u0026thinsp;=\u0026thinsp;0.78 (CT), \u0026omega;\u0026thinsp;=\u0026thinsp;0.81 (COL), \u0026omega;\u0026thinsp;=\u0026thinsp;0.76 (CRE). These estimates are preliminary data and robust validation in multi-institutional, multi-instrument research is suggested as a focus in future studies.\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\n \u003ch2\u003e3.3 Governance Mechanisms\u003c/h2\u003e\n \u003cp\u003eThere are two stages of bias mitigation. During pre-processing, a Reweighing algorithm (Kamiran \u0026amp; Calders, 2012) modifies the instance weights of the training set to equalise the relative effects of protected and non-protected groups during the model training, addressing distributional imbalances without discarding the instances or changing observed labels. In the in-processing step there is an Adversarial Debiasing element (Zhang et al., 2018) which simultaneously trains the predictive model against a fairness adversary, penalising representations which encode the group membership. This two-stage methodology simultaneously deals with bias at the data level and model level. There are two concomitant criteria for fairness calibration operation. Equalized odds (Hardt et al., 2016) entails that true positive rates are of equal value and false positive rates of equal magnitude between groups: |TPRₐ \u0026minus; TPR_b| + |FPRₐ \u0026minus; FPR_b| \u0026le; \u0026epsilon;, where \u0026epsilon;\u0026thinsp;=\u0026thinsp;0.05 in the current evaluation. Individual fairness (Dwork et al., 2012) requires that learners who are similar in the educational features of interest receive similar predictions.\u003c/p\u003e\n \u003cdiv id=\"Sec17\" class=\"Section3\"\u003e\n \u003ch2\u003e3.3.1 Reconciliation of the Conflicting Fairness Standards\u003c/h2\u003e\n \u003cp\u003eIn about 12 percent of simulation evaluation cycles, equalized odds and individual fairness produce inconsistent adjustment signals \u0026mdash; by which the change needed to increase group-level equity makes the individual-level predictions less accurate, or the other way round. The reconciliation mechanism resolves such conflicts by way of a three-stage process. First, a context classification function C(t) classifies the type of assessment at cycle t, thus assigning each cycle to either summative (final determination of competency influencing grades or movement) or formative (interim feedback, with no effect on the grade) ones. This function learns from institutional deployment configuration provided to it at system initialisation (course assessment scheduling, grade contribution weights, instructor-generated feedback cycles). Second, a conflict weight is found to be \u0026alpha;(t)\u0026thinsp;=\u0026thinsp;C(t) \u0026middot; w_G + (1\u0026thinsp;\u0026minus;\u0026thinsp;C(t)) \u0026middot; w_I, which states C(t)\u0026thinsp;=\u0026thinsp;1 in summative contexts and C(t)\u0026thinsp;=\u0026thinsp;0 in formative contexts; w_G\u0026thinsp;=\u0026thinsp;0.70 is the weight for group fairness (equalized odds), and w_I\u0026thinsp;=\u0026thinsp;0.30 is the weight for individual fairness. In summative-stakes scenarios with the highest odds for the systematic exclusion from high-action decisions, group fairness is valued as 0.70 and individual fairness 0.30. If the focus is on personalised learner-level support in formative feedback situations, individual level fairness is valued at 0.70 and group fairness at 0.30. Third, g(Ŷ) governance calibration functions apply the weighted adjustment to yield Y* - with the magnitude of the adjustment bounded to \u0026epsilon;\u0026thinsp;=\u0026thinsp;0.05 to avoid over-correction and introducing new inequity. Over the 12% of cycles with conflicting criteria this eliminates an average of 34% deviation in conflict-induced prediction variance relative to unweighted arbitration. Interpretability results from a hybrid mechanism that employs SHAP (Lundberg \u0026amp; Lee, 2017) to achieve a global feature attribution function and LIME (Ribeiro et al., 2016) to capture learner-specific explanation. SHAP values are calculated from the whole population at each assessment cycle, and these provide system-level transparency for institutional audit. LIME produces locally faithful linear approximations of model behaviour for individual learners, which is translated into natural-language feedback by a template-based generation module using the established Explainability Evaluation Metrics framework (Hoffman et al., 2018). The hybrid approach is the best balance between both approaches: SHAP produces globally consistent attributions and LIME is computationally efficient at the individual level. Collectively, they promote both population-level accountability and learner-level actionability.\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e"},{"header":"4. Research Hypotheses","content":"\u003cp\u003eHere are four hypotheses from the theoretical structure: The first couple demonstrate the independent contribution of AI-based assessment and governance. The third looks at what their integration entails. The fourth focuses on a structural claim that sets this study apart from previous work: that governance does not merely provide improvement for assessment quality as an independent factor, but rather moderates the relationship between AI-based prediction and overall quality \u0026mdash; underlining the enhancement of prediction effects in the presence of embedded governance.\u003c/p\u003e \u003cp\u003eH1: AI-driven intelligent assessment has a strong positive impact on the accuracy of assessing transdisciplinary competence.\u003c/p\u003e \u003cp\u003eH2: Governance mechanisms enhance assessment outcomes fairness, helping to minimize systematic bias among learner populations.\u003c/p\u003e \u003cp\u003eH3: Integrating AI-based approaches with integrated governance mechanisms has positively influenced the overall quality of educational assessment (compared to one or the other in isolation).\u003c/p\u003e \u003cp\u003eH4: Governance mechanisms moderate the relation between AI-based assessment and assessment quality and therefore, the positive influence of the predictive models is much stronger when governance is integrated into the system architecture.\u003c/p\u003e \u003cp\u003eThese hypotheses are then cumulative, logically. H1 and H2 demonstrate that each component adds something unique. H3 shows that together they get something that neither achieves in isolation. H4 treats governance as a structural condition not an additive improvement \u0026mdash; which is the study\u0026rsquo;s most significant empirical finding.\u003c/p\u003e"},{"header":"5. Methodology","content":"\u003cp\u003eDesign Science Research (DSR) was chosen as the overarching methodological framework because the problem at the core of this study is a design problem. What is required is not a description of an existing phenomenon but the construction of an artefact \u0026mdash; a working system \u0026mdash; and the demonstration that it performs as intended under realistic conditions. Traditional empirical research methods are oriented toward explaining what already exists; DSR is oriented toward building what does not yet exist and subjecting it to rigorous scrutiny.\u003c/p\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003e5.1 Design Science Research Process\u003c/h2\u003e \u003cp\u003e \u003cb\u003eProblem identification.\u003c/b\u003e This work is motivated by two interrelated challenges. Traditional assessment tools lack the capacity to assess transdisciplinary skills in PBL settings. At the same time, AI-based assessment systems deployed without embedded governance reproduce existing educational inequalities at scale and in ways that are hard to detect or challenge. What is required is an intervention system that can not only address such complex competencies that will continue to develop, but that is also fair and transparent.\u003c/p\u003e \u003cp\u003e \u003cb\u003eObjective definition.\u003c/b\u003e The aim of the design is a methodology that increases the accuracy of assessment but at the same time ensures that outcomes are equitable, transparent and understandable \u0026mdash; not by applying governance to outputs produced already, but by embedding it into the processes producing those outputs.\u003c/p\u003e \u003cp\u003e \u003cb\u003eDesign and development.\u003c/b\u003e The artefact is the multi-layer predictive governance framework proposed in Section \u003cspan refid=\"Sec8\" class=\"InternalRef\"\u003e3\u003c/span\u003e. It brings together educational data analytics, machine learning, bias mitigation, fairness calibration and interpretability mechanisms all under one operational system. Interpretability is not considered a complementary feature, but a design constraint.\u003c/p\u003e \u003cp\u003e \u003cb\u003eDemonstration and evaluation.\u003c/b\u003e The framework is illustrated and tested against two baselines: a legacy rule-based system without AI or governance (Baseline 1, preserved for historical consideration) and a contemporary AI evaluation infrastructure without embedded governance (Baseline 2, fully described in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e below). Baseline 2 is the main comparison, as it separates the contribution of governance integration from the contribution of AI capability in a wider sense.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003e5.2 Agent-Based Simulation Environment\u003c/h2\u003e \u003cp\u003eThe simulation evaluation was conducted within an agent-based environment reflecting the dynamic, non-linear interactions characteristic of interdisciplinary PBL. Agent-based modelling was selected because it can generate the kinds of complex, temporally extended, multi-actor data that the framework is designed to process \u0026mdash; data that cannot be obtained from existing datasets without ethical and practical difficulties that would preclude systematic experimental variation (Bonabeau, 2002). A population of 500 synthetic learner agents was modelled, each initialised with a distinct profile of academic performance, interaction patterns, collaboration tendencies, and behavioural characteristics, drawn from distributions calibrated against real higher-education data. These profiles evolved through agent-agent and agent-environment interactions across simulation cycles, generating the non-linear developmental trajectories that the framework is specifically designed to track. Four parameter dimensions were systematically varied across experimental scenarios: academic performance levels (high, moderate, low); interaction and collaboration patterns (individual-dominant, collaborative, mixed); task complexity (low, moderate, high); and contextual stress conditions (low, moderate, high).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003e5.3 Pilot Field Study\u003c/h2\u003e \u003cp\u003eThe pilot field study aimed to answer the complementary question that simulation does not answer: whether these performance characteristics of the framework transfer to a real educational setting where there are real students, real instructors, and real projects. The two types of evaluation are intentionally complementary: simulation facilitates systematic variation in experimentation that would be neither ethical nor practical in an authentic setting, while the field study adds ecological validity that simulation cannot.\u003c/p\u003e \u003cdiv id=\"Sec23\" class=\"Section3\"\u003e \u003ch2\u003e5.3.1 Participants and Setting\u003c/h2\u003e \u003cp\u003eThis research was carried out in Semester 2 of the 2023\u0026ndash;2024 academic year at a research-active university with established interdisciplinary PBL programmes. Forty-seven undergraduate students enrolled in 3 interdisciplinary PBL courses participated in the study: 24 in the experimental group (assessment supported by the framework) and 23 in the control group (assessment based on a conventional rubric). Across conditions, five faculty members served as supervising instructors. Full institutional ethics approval was obtained for this study (Protocol Reference: redacted for double-blind review). All participants provided written informed consent.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec24\" class=\"Section3\"\u003e \u003ch2\u003e5.3.2 Sample Size Justification\u003c/h2\u003e \u003cp\u003eThe sample of 47 participants was determined by a priori power analysis using G*Power 3.1 (Faul et al., 2007). For independent-samples t-tests (α\u0026thinsp;=\u0026thinsp;0.05, 80% power) to detect a medium effect (d\u0026thinsp;=\u0026thinsp;0.50), the number of at least 44 participants was chosen based on the results of the analysis. This number is lower than the achieved sample (n\u0026thinsp;=\u0026thinsp;47). The minimum practically significant difference (0.50) was chosen as the standard effect size (consistent with the educational assessment literature and effect sizes found in similar AI-in-education intervention studies, d\u0026thinsp;=\u0026thinsp;0.40\u0026ndash;0.70; Zawacki-Richter et al., 2019). The sample is from a single institution and is not generalisable \u0026ndash; this is a limitation explicitly stated in Section \u003cspan refid=\"Sec36\" class=\"InternalRef\"\u003e9\u003c/span\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec25\" class=\"Section3\"\u003e \u003ch2\u003e5.3.3 Procedure\u003c/h2\u003e \u003cp\u003eA pre-post design was used with competency scores at the start and end of a 10-week project cycle. Experimental group instructors received guided feedback reports based on the framework, consisting of competency scores, SHAP-based feature attributions, and LIME-generated natural language explanations. Instructors in the control group used validated project rubrics (Jonsson \u0026amp; Svingby, 2007) without any AI support. Both groups were given identical project briefs and the same duration of instructor contact. A second assessor blind to group assignment independently analysed 30% of project outputs in a random subsample. Inter-rater reliability was acceptable for all three competency dimensions: Critical Thinking (ICC\u0026thinsp;=\u0026thinsp;0.83; 95% CI [0.74, 0.89]), Collaboration (ICC\u0026thinsp;=\u0026thinsp;0.79; 95% CI [0.69, 0.86]), and Creativity (ICC\u0026thinsp;=\u0026thinsp;0.76; 95% CI [0.65, 0.84]).\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec26\" class=\"Section2\"\u003e \u003ch2\u003e5.4 Evaluation Metrics and Baseline Specifications\u003c/h2\u003e \u003cp\u003e Performance was assessed using 5 metrics, each used to consider technical and ethical aspects of quality of the assessment. Accuracy tracks the match of forecasted competencies with actual competence. Fairness measures equality in prediction errors across groups as the complement of group-level disparity. Reliability represents how consistently the simulation iterations occur. Average processing time per iteration of the assessment is measured by response time. Explainability is measured by SHAP attribution coverage and instructor-rated interpretability on the validated 7-point Explainability Evaluation Metrics scale (Hoffman et al., 2018). Overall assessment quality is defined as the weighted composite Q\u0026thinsp;=\u0026thinsp;w₁\u0026middot;A\u0026thinsp;+\u0026thinsp;w₂\u0026middot;F\u0026thinsp;+\u0026thinsp;w₃\u0026middot;R\u0026thinsp;+\u0026thinsp;w₄\u0026middot;T, weighting w₁ = 0.30 (Accuracy), w₂ = 0.30 (Fairness), w₃ = 0.25 (Reliability/Interpretability), w₄ = 0.15 (Response Time).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cem\u003eTechnical Specifications for Baseline 2 (GPT-4 Standard \u0026mdash; No Embedded Governance)\u003c/em\u003e\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eParameter\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSpecification\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel architecture\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGPT-4 (OpenAI) via API \u0026mdash; standard configuration, no fine-tuning\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePrompt structure\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eZero-shot assessment prompts; competency rubric in system message; interaction log excerpts as user input\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eInput data\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eIdentical multi-source feature set: academic records, interaction logs, behavioural indicators, collaborative activity data\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOutput format\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNumerical competency score (0\u0026ndash;1) per dimension with brief natural-language rationale; no SHAP/LIME attribution\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBias mitigation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNone \u0026mdash; no Reweighing, no Adversarial Debiasing\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFairness constraints\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNone \u0026mdash; no equalized odds enforcement; no individual fairness calibration\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGovernance integration\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePost-hoc human instructor review only\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTraining data\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGPT-4 pre-training corpus (OpenAI, 2023); no domain-specific fine-tuning\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEvaluation conditions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eIdentical: same 500-agent population, same feature set, same five metrics, same 50 replications\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eNote. Baseline 2 was selected as the primary comparison because it represents the current state of practice in generative AI-based educational assessment deployed without embedded governance. The deliberate absence of bias mitigation and fairness constraints in Baseline 2 isolates the governance contribution of the proposed framework.\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eTo further isolate the contribution of embedded governance mechanisms, an additional baseline was introduced using ensemble machine learning models without governance integration. This baseline employs the same Random Forest and Gradient Boosting algorithms used in the proposed framework but excludes all governance components, including bias mitigation, fairness calibration, and explainability mechanisms. This comparison enables a more rigorous evaluation of whether performance improvements are attributable to the governance layer rather than the predictive modeling itself.\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cem\u003eTechnical Specifications for Baseline 3 (Ensemble ML without Governance)\u003c/em\u003e\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eParameter\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSpecification\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel architecture\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRandom Forest\u0026thinsp;+\u0026thinsp;Gradient Boosting (ensemble)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eInput data\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eIdentical multi-source feature set\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOutput format\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNumerical competency scores (0\u0026ndash;1)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBias mitigation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNone\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFairness constraints\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNone\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGovernance integration\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNot included\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eInterpretability\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNot included (no SHAP/LIME)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTraining\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eStandard supervised learning\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEvaluation conditions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eIdentical to proposed framework\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec27\" class=\"Section2\"\u003e \u003ch2\u003e5.5 .Threats to Validity\u003c/h2\u003e \u003cp\u003eSeveral limitations should be considered when interpreting the results of this study. First, the pilot field study was conducted with a relatively small sample size (n\u0026thinsp;=\u0026thinsp;47) from a single institutional context, which may limit the generalisability of the findings. Future research should validate the framework across multiple institutions and larger, more diverse populations.\u003c/p\u003e \u003cp\u003eSecond, the simulation environment, while designed to approximate realistic PBL dynamics, cannot fully capture the complexity of real-world educational settings. Although agent-based modelling enables controlled experimentation, it remains an abstraction of actual learner behaviour.\u003c/p\u003e \u003cp\u003eThird, the operationalisation of transdisciplinary competencies relies on proxy indicators derived from interaction data, which may not fully represent the depth of cognitive and collaborative processes. While convergent validity was established through instructor ratings, further validation using multiple assessment instruments is recommended.\u003c/p\u003e \u003cp\u003eFinally, the proposed governance mechanisms are implemented within a specific set of fairness criteria and thresholds. Alternative definitions of fairness or different parameter settings may lead to different outcomes, suggesting the need for context-sensitive calibration in future implementations.\u003c/p\u003e \u003c/div\u003e"},{"header":"6. Experimental Evaluation and Results","content":"\u003cdiv id=\"Sec29\" class=\"Section2\"\u003e \u003ch2\u003e6.1 Primary Quantitative Results\u003c/h2\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e3\u003c/span\u003e presents the performance comparison between the proposed framework and Baseline 2 across five evaluation metrics. Values represent means across 50 simulation replications; all between-condition differences are statistically significant at p\u0026thinsp;\u0026lt;\u0026thinsp;0.05 or better.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cem\u003ePerformance Comparison: Proposed Framework vs. Baseline 2 (GPT-4 Standard, No Embedded Governance)\u003c/em\u003e\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMetric\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBaseline 2 (GPT-4)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProposed Framework\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eImprovement\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.75\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.92\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e+\u0026thinsp;22.7%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFairness\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.88\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e+\u0026thinsp;46.7%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eReliability\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.90\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e+\u0026thinsp;28.6%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eResponse Time\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.65\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.85\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e+\u0026thinsp;30.8%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eExplainability\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.55\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.91\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e+\u0026thinsp;65.5%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eNote. All differences statistically significant at p\u0026thinsp;\u0026lt;\u0026thinsp;0.05; accuracy, fairness, and explainability at p\u0026thinsp;\u0026lt;\u0026thinsp;0.01. Values are means across 50 simulation replications. Cohen's d range: 0.71\u0026ndash;1.12. 95% CIs are reported in the supplementary materials.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec30\" class=\"Section2\"\u003e \u003ch2\u003e6.2 Results Analysis\u003c/h2\u003e \u003cp\u003eThis accuracy improvement, from 0.75 to 0.92 (+\u0026thinsp;22.7%), indicates that the proposed model identifies complex, non-linear patterns in learner performance data that Baseline 2, operating without the discipline imposed by embedded governance, does not capture as consistently. Importantly, this improvement demonstrates that the integration of governance does not come at the expense of predictive performance, a point that becomes particularly important when paired with the fairness results. In absolute terms, fairness improvement is the largest gain (+\u0026thinsp;46.7%). Baseline 2 yields significantly inequitable outcomes across learner subgroups \u0026mdash; not because of intentional design but as an expected consequence of training on historical educational data that already reflects existing inequalities. This governance layer solves for this by shifting the rules of prediction generation from the output stage to the modelling stage instead of output filtering at that later time. The 46.7% increase is just what it looks like to change the process rather than audit the outputs. This result supports H2. The fact that improvements in reliability (+\u0026thinsp;28.6%) and response time (+\u0026thinsp;30.8%) were achieved reinforces that governance integration neither destabilises the assessment system nor slows it down. The consistency improvements across simulation iterations show that governance mechanisms impose a constructive type of discipline on the predictive models, lowering variance in outputs \u0026mdash; something that matters in practical terms, since it\u0026rsquo;s hard to trust those systems to make consequential individual decisions when they produce different results under similar conditions. But the explainability improvement is the largest by a significant proportion (+\u0026thinsp;65.5%). The SHAP-LIME hybrid mechanism significantly increases the proportion of prediction variance attributable to identifiable features and generates natural-language explanations that instructors in the field study characterized as being interpretable and actionable, as measured with the Hoffman et al. (2018) validated scale. An output of an assessment that cannot be explained to an educator who understands the learning context is an output that cannot be meaningfully contested \u0026mdash; and one that cannot be contested cannot be trusted. When taken together, these results reinforce H3.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec31\" class=\"Section2\"\u003e \u003ch2\u003e6.3 Moderation Analysis: Testing H4\u003c/h2\u003e \u003cp\u003eThe moderation hypothesis (H4) \u0026mdash; governance as a structural moderating variable instead of additive variable \u0026mdash; was examined through an analysis of comparisons of the association between AI-facilitated predictive performance and overall quality under presence and absence of embedded governance. Without governance (Baseline 2) \u0026mdash; increasing predictive accuracy does not lead to increased assessment quality proportionally as no constraints exist on fairness and reliability. For contexts where governance is embedded, increasing predictive accuracy results more consistently in improvements in overall quality. Moderation analysis verified that the governance condition greatly improved the accuracy-to-quality relationship (β\u0026thinsp;=\u0026thinsp;0.47, p \u0026lt; .001). Governance is not just an additive enhancement; it is a structural condition that alters the performance of a predictive component.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec32\" class=\"Section2\"\u003e \u003ch2\u003e6.4 Longitudinal Performance\u003c/h2\u003e \u003cp\u003ePerformance was measured on five consecutive testing cycles, on Weeks 2, 4, 6, 8 and 10 from the simulated project cycle. The proposed framework demonstrates a similar pattern of iterative improvement: fairness increases from 0.71 (Cycle 1) to 0.89 (Cycle 5) while improvements in fairness are greatest between Cycles 1 and 3 as the governance calibration mechanism has time to obtain enough data to enhance bias-mitigation decisions. Accuracy stabilises at 0.92 by Cycle 4. Explainability exhibits the steepest initial gain\u0026mdash;from 0.72 all the way to 0.88 between Cycles 1 and 2\u0026mdash;as interaction data gathers so quickly that the SHAP attribution model is constantly updated. Baseline 2 shows a marginal increase across the cycle (mean gain per cycle\u0026thinsp;=\u0026thinsp;0.008 vs. 0.036 for the proposed framework) indicative of the lack of a feedback-adaptive governance mechanism.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec33\" class=\"Section2\"\u003e \u003ch2\u003e6.5 Pilot Field Study Results\u003c/h2\u003e \u003cp\u003e001, d\u0026thinsp;=\u0026thinsp;0.68); Fairness \u0026mdash; measured as cross-group score disparity reduction (t(45)\u0026thinsp;=\u0026thinsp;5.61, p \u0026lt; .001, d\u0026thinsp;=\u0026thinsp;0.91); Reliability (t(45)\u0026thinsp;=\u0026thinsp;3.87, p = .001, d\u0026thinsp;=\u0026thinsp;0.74); and Interpretability as rated by instructors on the Hoffman et al. (2018) validated 7-point scale (t(45)\u0026thinsp;=\u0026thinsp;6.14, p \u0026lt; .001, d\u0026thinsp;=\u0026thinsp;0.89). Effect sizes (d range: 0.68\u0026ndash;0.91) are consistent with simulation findings \u0026mdash; suggesting that the performance characteristics observed under controlled conditions reflect properties of the architecture that transfer to authentic settings. These results are interpreted with appropriate caution given the sample size and single-institution design.\u003c/p\u003e \u003c/div\u003e"},{"header":"7. Discussion","content":"\u003cp\u003eThe data are consistent, statistically robust, and generalised to both simulation and field contexts. And this consistency matters: a system that works well only in a good deal of circumstances is not a system we can rely on under the genuinely variable and unpredictable conditions of authentic educational environments. The value of the framework is highest in fairness and explainability\u0026mdash;specifically, where the governance embedded in the framework best adds value\u0026mdash;confirming that we have enhanced governance and it is not just that our predictive modelling has been improved. The finding of fairness does warrant special attention. Baseline 2, a technically sound and pervasive AI methodology, produces systematically less equitable outcomes across learner subgroups. It is not a design problem in any narrow sense; rather, it is the natural outcome of training a system on historical educational data with no means to ascertain and adjust for patterns of inequality contained in the data. The improvement on the governance layer is not filtering data resulting from an unchanged model. It alters the mechanism whereby predictions are created \u0026mdash; what features the model depends on, how it addresses group-correlated patterns, how it reacts in the presence of bias-mitigation and fairness signals in conflict. The 46.7% improvement in fairness represents a quantifiable difference between running the process and auditing the output. There are also implications for the relationship between fairness and accuracy in these findings. It is widely believed\u0026mdash;and given theoretical impossibility results in the algorithmic fairness literature (Chouldechova, 2017; Kleinberg et al., 2016) that the tradeoff for fairness improvement is sacrificing some accuracy. These results put pressure on that assumption \u0026mdash; or at least qualify it quite convincingly. The impossibility results apply to post-hoc fairness interventions in fixed models; they don't necessarily apply when governance is built into the training process itself. In the case that Reweighing and Adversarial Debiasing are active during training, the model learns representations that are, at least in part, less sensitive to group-correlated spurious features \u0026mdash; thus reducing bias while optimizing predictive accuracy on the features that are educationally relevant. This fairness-accuracy tradeoff is thus an artefact of the way AI systems in general are designed rather than an end result of the prediction problem itself. The results of the explainability can have both practical application and theoretical significance. The SHAP-LIME hybrid provides explanations that instructors rated as meaningful and usable and not only accepted as a fact of fact. An instructor who knows why a student received a competency assessment is then able to be aware of whether that reasoning actually reflects their understanding of that student's learning, if it doesn't, challenge that reasoning, and if they do, use that reasoning to make instructional choices. Such human-AI interaction positions the AI system as a collaborator in assessment \u0026mdash; rather than an authority \u0026mdash; in contrast to an automatic black-box output, where it is either accepted, rejected or neutral. Both response time and reliability measures are alleviated by the practical concern that such governance integration will impose computational overhead and introduce variability. Neither of these concerns is supported. And thus, the governance mechanisms, which are integrated with prediction rather than applied sequentially afterward, did not introduce the processing challenges that a layered architecture would cause. And the discipline imparted by constant fairness constraints seems to reduce rather than increase output variability \u0026mdash; an intuitive finding, given that bias in your predictions is itself a source of inconsistency. Methodological comment: agent-based simulation should be noted. The decision was intentional: simulation allows systematic variability between experimental conditions in a way that would be ethically and practically impossible to achieve at this point in the framework's development where it was applicable in real-world educational contexts. That 500-agent population, calibrated against real distributions of higher education data, provides the kinds of complex, temporally long, non-linear data the framework is designed to accommodate. Such consistency between the effect size for simulation and the findings of field studies \u0026mdash; itself an empirical significance \u0026mdash; implies that the simulation captures something authentic about a framework architecture rather than only generating its own calibration parameter results.\u003c/p\u003e"},{"header":"8. Implications","content":"\u003cp\u003e \u003cb\u003ePractical implications.\u003c/b\u003e Schools that want to implement the framework can follow four stages. In Stage 1 (Infrastructure Assessment, 1 to 2 months), organizations run an infrastructure audit of their digital learning environment to see if it generates the four streams of data the framework asks for through its Data Layer: academic records, interaction logs, behavioural indicators, and collaborative activity data. Institutions that have interaction logs enabled on LMS applications (Canvas, Moodle, Blackboard) will usually have sufficient infrastructure \u0026ndash; those predominantly in face-to-face or low-tech spaces need to accommodate this limitation before they can continue. In Stage 2 (Governance Parameter Configuration, about 1 month) the assessment coordinators set up the context classification function, giving the assessment calendar, degree of grading contribution weights, and formative/summative designation for each assessment event. In Stage 3 (Instructor Orientation, 2\u0026ndash;3 classes), instructors are introduced to the guided training in reading framework output reports\u0026mdash;specifically to interpreting SHAP attribution charts and natural-language explanations (e.g. LIME) in regard to the competency constructs being evaluated. In practice, the interpretability improvements we see presented in Section \u003cspan refid=\"Sec33\" class=\"InternalRef\"\u003e6.5\u003c/span\u003e are only applicable when the instructor is prepared for how they are to apply them to teachers\u0026rsquo; instructional decisions. In Stage 4 (Monitored Deployment, one full project cycle) the framework will be put into place in parallel to existing assessment modes, and outcomes will be reviewed by instructors and human judgement prior to full incorporation. Full deployment without human intervention is not the approach that is recommended at this stage of the framework design.\u003c/p\u003e \u003cp\u003e \u003cb\u003ePolicy implications.\u003c/b\u003e The difference between Baseline 2 (0.60 fairness) and the model we propose (0.88 fairness) is not a theoretical risk \u0026mdash; it\u0026rsquo;s a palpable result of using AI assessment without embedded governance (i.e., one that has been documented in a controlled environment). Policymakers establishing standards for AI in education that reflect the international frameworks of UNESCO (2023) and the OECD (2021) should recognise embedded governance as a required precondition for responsible deployment, rather than an addendum. Outside regulation when applied following deployment is consistently inadequate: it may discover problems that have already delivered unfair results, but it cannot prevent them from happening.\u003c/p\u003e \u003cp\u003e \u003cb\u003eEducational implications.\u003c/b\u003e This framework, which facilitates shift from outcome-related to process-based assessment, has direct implications for the manner in which transdisciplinary competencies are conceptualized and assessed. Non-linear and contextually developed competencies must not be objectively captured by static knowledge tests; the framework establishes a useful operational alternative. Institutions will have to invest in fostering educators\u0026rsquo; ability to interpret and critically engage with AI-generated assessment insights \u0026mdash; a professional development demand that the framework\u0026rsquo;s interpretability gains make far more tractable than opaque AI systems do.\u003c/p\u003e \u003cp\u003e \u003cb\u003eTheoretical implications\u003c/b\u003e. This study reframes the relation between AI prediction and governance as one of the main theoretical contributions of the study. Governance is not something that can\u0026rsquo;t be done with AI assessment systems; if properly embedded it is a condition of what they do. This redirects the design question from how to trade off accuracy and ethical design to how to design systems in which both of these properties interact and reinforce each other. The contention that embedded governance enhances both fairness and accuracy at the same time \u0026mdash; that the trade-off that would normally be presumed to be an artefact of governance-as-afterthought \u0026mdash; is the most significant theoretical result of the study.\u003c/p\u003e"},{"header":"9. Conclusion","content":"\u003cp\u003eThis research seeks to fill a documented and underexplored gap in the literature on AI-driven educational assessment, in that no research framework has included AI-based prediction, embedded governance and the unique issues in assessing transdisciplinary competence in PBL contexts in a unified framework. The proposed predictive governance model demonstrated in this work shows that this integration is technically possible, and it has tangible performance gains in all five domains (accuracy, fairness, reliability, efficiency, and explainability), that determine if an assessment system can be trusted in complex learning environments. The most important finding of the research relates to the correlation between fairness and accuracy. Indeed, contrary to the premise\u0026mdash;a belief that the algorithmic fairness landscape may not be possible and that these properties trade off with each other\u0026mdash;as documented by the abstract algorithmic fairness literature, the evidence here suggests that they play into each other when governance is embedded in the training process, as opposed to once again applying it to outputs. This implies that the commonly observed fairness-accuracy tension may also be partially an artefact rather than an intrinsic quality of the prediction problem, particularly given that AI assessment systems are typically designed in specific ways. If correct, this translates into very real implications regarding how the field understands the design environment of responsible AI assessment. These conclusions are qualified by several restrictions. As the current assessment uses agent-based simulation, and as the preliminary ecological validity of the pilot field study is achieved, the results need to be further examined on a larger scale at different institutions and educational settings to reach strong generalizations. The pilot sample (n\u0026thinsp;=\u0026thinsp;47, single institution, 10-week period) is too small for considerations of generalisability across institutional cultures, disciplinary contexts or students in which this model might fit into, and is likely appropriate only in a very limited number of instances. The framework is also based on data from digital interaction logs, which may not adequately reflect learning that occurs in face-to-face or low technology project contexts\u0026mdash;a limitation that institutions should explicitly evaluate prior to implementation. More specifically, there are four avenues for future research: multi-institutional validation studies using larger and more representative sample sizes; expanding the input variable set with insights from learning evidence gained out of non-digital contexts; measuring the moderating effect of institutional culture and disciplinary norms on the success of governance-integrated assessment systems; and assessing the extent to which the performance benefits of the reconciliation mechanism are replicable across institutional deployment conditions that differ significantly from those used in this study. The last is key: context classification is trained on institutional settings specified in initialisation, and its generalisation to greatly unique institutional settings is still an open empirical question. What this study ultimately shows is that the decision confronting education institutions is not between AI ability and ethical responsibility \u0026mdash; between systems that are accurate and systems that are fair and explainable. It is a design challenge. And it\u0026rsquo;s one for which this framework provides a deliberately constructed, testable answer.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eA.H. conceptualized the study, developed the methodology, designed the model, and wrote the original draft of the manuscript. N.S. conducted data analysis, contributed to validation, and participated in writing, reviewing, and editing the manuscript. All authors reviewed and approved the final version of the manuscript.\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eData Availability Statement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eBoth the simulation datasets and the agent-based modelling code produced in this work can be accessed upon reasonable request by the corresponding author and will be deposited into an open-access repository upon acceptance. Data from the pilot field study are not publicly available because of institutional ethics approval conditions and participant privacy protections; de-identified summary statistics may be made available to qualified researchers through a formal data sharing agreement.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthical approval and consent to participate:\u0026nbsp;\u003c/strong\u003eThe pilot field study received full institutional ethics approval (Protocol Reference: redacted for double-blind review). All participants provided written informed consent and were explicitly informed of their right to withdraw at any time without academic consequence. Use of AI \u003cstrong\u003ewriting tools:\u003c/strong\u003e AI-assisted language tools were used for grammar checking and reference formatting in preparing this manuscript. All intellectual content, analytical decisions, theoretical arguments, and empirical interpretations are the authors\u0026apos; own. The authors take full responsibility for the integrity and accuracy of all reported work.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests:\u003c/strong\u003e The authors declare no competing interests, financial or otherwise, related to the work reported in this manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;Funding:\u003c/strong\u003e [Funding information \u0026mdash; redacted for double-blind review]\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eBaker, R., \u0026amp; Hawn, A. (2022). Algorithmic bias in education. Educational Researcher, 51(4), 259\u0026ndash;268. https://doi.org/10.3102/0013189X211059733\u003c/li\u003e\n\u003cli\u003eBarocas, S., Hardt, M., \u0026amp; Narayanan, A. (2019). Fairness and machine learning: Limitations and opportunities. MIT Press. https://fairmlbook.org\u003c/li\u003e\n\u003cli\u003eBaskerville, R., Baiyere, A., Gregor, S., Hevner, A., \u0026amp; Rossi, M. (2018). Design science research contributions: Finding a balance between artifact and theory. Journal of the Association for Information Systems, 19(5), 358\u0026ndash;376. https://doi.org/10.17705/1jais.00495\u003c/li\u003e\n\u003cli\u003eBonabeau, E. (2002). Agent-based modeling: Methods and techniques for simulating human systems. Proceedings of the National Academy of Sciences, 99(Suppl. 3), 7280\u0026ndash;7287. https://doi.org/10.1073/pnas.082080899\u003c/li\u003e\n\u003cli\u003eCampbell, D. T., \u0026amp; Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81\u0026ndash;105. https://doi.org/10.1037/h0046016\u003c/li\u003e\n\u003cli\u003eChouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2), 153\u0026ndash;163. https://doi.org/10.1089/big.2016.0047\u003c/li\u003e\n\u003cli\u003eDwivedi, Y. K., et al. (2023). So what if ChatGPT wrote it? Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. International Journal of Information Management, 71, 102642. https://doi.org/10.1016/j.ijinfomgt.2023.102642\u003c/li\u003e\n\u003cli\u003eDwork, C., Hardt, M., Pitassi, T., Reingold, O., \u0026amp; Zemel, R. (2012). Fairness through awareness. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (pp. 214\u0026ndash;226). ACM. https://doi.org/10.1145/2090236.2090255\u003c/li\u003e\n\u003cli\u003eFaul, F., Erdfelder, E., Lang, A.-G., \u0026amp; Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39(2), 175\u0026ndash;191. https://doi.org/10.3758/BF03193146\u003c/li\u003e\n\u003cli\u003eFloridi, L., et al. (2018). AI4People \u0026mdash; An ethical framework for a good AI society. Minds and Machines, 28(4), 689\u0026ndash;707. https://doi.org/10.1007/s11023-018-9482-5\u003c/li\u003e\n\u003cli\u003eGa\u0026scaron;ević, D., Dawson, S., \u0026amp; Siemens, G. (2015). Let\u0026apos;s not forget: Learning analytics are about learning. TechTrends, 59(1), 64\u0026ndash;71. https://doi.org/10.1007/s11528-014-0822-x\u003c/li\u003e\n\u003cli\u003eGregor, S., \u0026amp; Hevner, A. R. (2013). Positioning and presenting design science research for maximum impact. MIS Quarterly, 37(2), 337\u0026ndash;355. https://doi.org/10.25300/MISQ/2013/37.2.01\u003c/li\u003e\n\u003cli\u003eHardt, M., Price, E., \u0026amp; Srebro, N. (2016). Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems, 29, 3315\u0026ndash;3323.\u003c/li\u003e\n\u003cli\u003eHasson, F., Keeney, S., \u0026amp; McKenna, H. (2000). Research guidelines for the Delphi survey technique. Journal of Advanced Nursing, 32(4), 1008\u0026ndash;1015. https://doi.org/10.1046/j.1365-2648.2000.t01-1-01567.x\u003c/li\u003e\n\u003cli\u003eHevner, A. R., March, S. T., Park, J., \u0026amp; Ram, S. (2004). Design science in information systems research. MIS Quarterly, 28(1), 75\u0026ndash;105. https://doi.org/10.2307/25148625\u003c/li\u003e\n\u003cli\u003eHoffman, R. R., Mueller, S. T., Klein, G., \u0026amp; Litman, J. (2018). Metrics for explainable AI: Challenges and prospects. arXiv preprint arXiv:1812.04608.\u003c/li\u003e\n\u003cli\u003eHolmes, W., Bialik, M., \u0026amp; Fadel, C. (2019). Artificial intelligence in education: Promises and implications for teaching and learning. Center for Curriculum Redesign.\u003c/li\u003e\n\u003cli\u003eIfenthaler, D., \u0026amp; Yau, J. Y.-K. (2020). Utilising learning analytics to support study success in higher education: A systematic review. Educational Technology Research and Development, 68(4), 1961\u0026ndash;1990. https://doi.org/10.1007/s11423-020-09788-z\u003c/li\u003e\n\u003cli\u003eJonsson, A., \u0026amp; Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130\u0026ndash;144. https://doi.org/10.1016/j.edurev.2007.05.002\u003c/li\u003e\n\u003cli\u003eKamiran, F., \u0026amp; Calders, T. (2012). Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1), 1\u0026ndash;33. https://doi.org/10.1007/s10115-011-0463-8\u003c/li\u003e\n\u003cli\u003eKleinberg, J., Mullainathan, S., \u0026amp; Raghavan, M. (2016). Inherent trade-offs in the fair determination of risk scores. Proceedings of Innovations in Theoretical Computer Science. https://doi.org/10.4230/LIPIcs.ITCS.2017.43\u003c/li\u003e\n\u003cli\u003eKokotsaki, D., Menzies, V., \u0026amp; Wiggins, A. (2016). Project-based learning: A review of the literature. Improving Schools, 19(3), 267\u0026ndash;277. https://doi.org/10.1177/1365480216659733\u003c/li\u003e\n\u003cli\u003eKovanović, V., Ga\u0026scaron;ević, D., Joksimović, S., Hatala, M., \u0026amp; Siemens, G. (2021). What is learning analytics? An integrative definition and conceptual framework. Journal of Learning Analytics, 8(1), 1\u0026ndash;5. https://doi.org/10.18608/jla.2021.7140\u003c/li\u003e\n\u003cli\u003eLuckin, R., Holmes, W., Griffiths, M., \u0026amp; Forcier, L. B. (2016). Intelligence unleashed: An argument for AI in education. Pearson.\u003c/li\u003e\n\u003cli\u003eLundberg, S. M., \u0026amp; Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 4765\u0026ndash;4774.\u003c/li\u003e\n\u003cli\u003eMitchell, S., Potash, E., Barocas, S., D\u0026apos;Amour, A., \u0026amp; Lum, K. (2021). Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application, 8, 141\u0026ndash;163. https://doi.org/10.1146/annurev-statistics-042720-125902\u003c/li\u003e\n\u003cli\u003eOrganisation for Economic Co-operation and Development. (2021). Recommendation of the Council on Artificial Intelligence. OECD Publishing. https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449\u003c/li\u003e\n\u003cli\u003ePeffers, K., Tuunanen, T., Rothenberger, M. A., \u0026amp; Chatterjee, S. (2007). A design science research methodology for information systems research. Journal of Management Information Systems, 24(3), 45\u0026ndash;77. https://doi.org/10.2753/MIS0742-1222240302\u003c/li\u003e\n\u003cli\u003ePerignat, E., \u0026amp; Katz-Buonincontro, J. (2019). STEAM in practice and research: An integrative literature review. Thinking Skills and Creativity, 31, 31\u0026ndash;43. https://doi.org/10.1016/j.tsc.2018.10.002\u003c/li\u003e\n\u003cli\u003eReimers, N., \u0026amp; Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. https://doi.org/10.18653/v1/D19-1410\u003c/li\u003e\n\u003cli\u003eRibeiro, M. T., Singh, S., \u0026amp; Guestrin, C. (2016). \u0026apos;Why should I trust you?\u0026apos;: Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135\u0026ndash;1144). https://doi.org/10.1145/2939672.2939778\u003c/li\u003e\n\u003cli\u003eUNESCO. (2023). Guidance for generative AI in education and research. UNESCO Publishing. https://doi.org/10.54675/PCSP7350\u003c/li\u003e\n\u003cli\u003evan Gelder, T. (2005). Teaching critical thinking: Some lessons from cognitive science. College Teaching, 53(1), 41\u0026ndash;48. https://doi.org/10.3200/CTCH.53.1.41-48\u003c/li\u003e\n\u003cli\u003eViberg, O., Hatakka, M., B\u0026auml;lter, O., \u0026amp; Mavroudi, A. (2018). The current landscape of learning analytics in higher education. Computers in Human Behavior, 89, 98\u0026ndash;110. https://doi.org/10.1016/j.chb.2018.07.027\u003c/li\u003e\n\u003cli\u003eVoogt, J., \u0026amp; Roblin, N. P. (2012). A comparative analysis of international frameworks for 21st century competences. Journal of Curriculum Studies, 44(3), 299\u0026ndash;321. https://doi.org/10.1080/00220272.2012.668938\u003c/li\u003e\n\u003cli\u003eWasserman, S., \u0026amp; Faust, K. (1994). Social network analysis: Methods and applications. Cambridge University Press.\u003c/li\u003e\n\u003cli\u003eWorld Economic Forum. (2024). The future of jobs report 2024. World Economic Forum. https://www.weforum.org/publications/the-future-of-jobs-report-2024/\u003c/li\u003e\n\u003cli\u003eZawacki-Richter, O., Mar\u0026iacute;n, V. I., Bond, M., \u0026amp; Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education \u0026mdash; where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 39. https://doi.org/10.1186/s41239-019-0171-0\u003c/li\u003e\n\u003cli\u003eZhang, B. H., Lemoine, B., \u0026amp; Mitchell, M. (2018). Mitigating unwanted biases with adversarial learning. Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society (pp. 335\u0026ndash;340). https://doi.org/10.1145/3278721.3278779\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"intelligent assessment, AI governance, transdisciplinary competencies, project-based learning, predictive modelling, algorithmic fairness, bias mitigation","lastPublishedDoi":"10.21203/rs.3.rs-9401633/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9401633/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThe assessment of transdisciplinary competencies\u0026mdash;such as critical thinking, collaboration, and creativity\u0026mdash;remains a major challenge in higher education, particularly within project-based learning (PBL) environments where learning is complex, dynamic, and context-dependent. Traditional assessment approaches are limited in capturing such competencies due to their reliance on static and isolated measurement methods.\u003c/p\u003e \u003cp\u003eThis study proposes an AI-driven intelligent assessment system that integrates machine learning techniques with learning analytics to evaluate students\u0026rsquo; competencies using multi-source educational data, including academic performance, interaction logs, and collaborative activity indicators. The system employs ensemble models (Random Forest and Gradient Boosting) to generate predictive assessments, supported by explainability mechanisms (SHAP and LIME) to enhance interpretability and instructional usability.\u003c/p\u003e \u003cp\u003eTo ensure responsible and equitable assessment, the system incorporates embedded mechanisms for bias mitigation and fairness calibration during the prediction process. The proposed framework was evaluated using a dual-method approach: (1) an agent-based simulation involving 500 synthetic learners to examine system performance under controlled conditions, and (2) a pilot empirical study with 47 undergraduate students enrolled in interdisciplinary PBL courses.\u003c/p\u003e \u003cp\u003eThe results demonstrate significant improvements over a baseline AI model across five key evaluation metrics: accuracy (+\u0026thinsp;22.7%), fairness (+\u0026thinsp;46.7%), reliability (+\u0026thinsp;28.6%), processing efficiency (+\u0026thinsp;30.8%), and explainability (+\u0026thinsp;65.5%). Empirical findings further confirm the system\u0026rsquo;s effectiveness, with medium-to-large effect sizes (d\u0026thinsp;=\u0026thinsp;0.68\u0026ndash;0.91) across all measured dimensions.\u003c/p\u003e \u003cp\u003eThe study contributes to the field of educational technology by presenting a scalable and interpretable AI-based assessment system that enhances both the quality and fairness of evaluation in complex learning environments. The findings highlight the potential of integrating AI-driven analytics with responsible design principles to support data-informed educational decision-making.\u003c/p\u003e","manuscriptTitle":"AI-Driven Intelligent Assessment System for Evaluating Transdisciplinary Competencies in Project-Based Learning: An Empirical and Simulation Study","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-04-16 08:41:52","doi":"10.21203/rs.3.rs-9401633/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"03414f70-a9be-4ffe-b791-97350bc9ca4c","owner":[],"postedDate":"April 16th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-04-21T07:12:17+00:00","versionOfRecord":[],"versionCreatedAt":"2026-04-16 08:41:52","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9401633","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9401633","identity":"rs-9401633","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00