Artificial Intelligence and Law, 2011–2026: A Systematic Scoping Review of Methods, Benchmarks, and Open Challenges

preprint OA: closed
Full text JSON View at publisher
Full text 155,545 characters · extracted from preprint-html · click to expand
Artificial Intelligence and Law, 2011–2026: A Systematic Scoping Review of Methods, Benchmarks, and Open Challenges | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Systematic Review Artificial Intelligence and Law, 2011–2026: A Systematic Scoping Review of Methods, Benchmarks, and Open Challenges Pradeep Kumar This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8913025/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract This systematic scoping review examines how research at the intersection of Artificial Intelligence and Law (AI & Law) has evolved over the fifteen-year period from 2011 to 2026. Following a PRISMA-ScR-informed protocol, we synthesise contributions published primarily in Artificial Intelligence and Law and related venues across two converging paradigms: (i) symbolic, argumentation-based, and formal models for legal knowledge representation, normative reasoning, and justification, and (ii) statistical, machine learning, and natural language processing (NLP) approaches that analyse, predict, and retrieve legal text at scale. Our core finding is that the field has transitioned from a dichotomy of 'AI or Law' toward hybrid socio-technical systems in which formal guarantees—normative consistency, traceability, and human oversight—must coexist with empirical performance demands such as robust generalisation, reproducibility, and realistic task evaluation. Methodologically, a clear shift from relatively closed, domain-specific systems toward open benchmarks, open data, and open implementations is observable, particularly in legal NLP and legal information retrieval/entailment competitions. Yet a crucial distinction persists: the difference between 'predicting correctly' and 'reasoning legally.' Multiple contributions emphasise that predictive models without adequate explanation and justification frameworks remain legally and socially problematic. We operationalise a triadic taxonomy—text-centric, reasoning-centric, and governance-centric—and map representative works onto method families (symbolic, statistical, hybrid), datasets and benchmarks, and application domains (contract analysis, e-discovery, compliance checking, adjudication support, and argument mining). The EU AI Act's risk-based framework, with phased applicability through 2026–2027, directly amplifies research questions around transparency, documentation, human oversight, and data quality. We conclude with a concrete research agenda identifying five open challenges: justification-oriented benchmark design, hybrid LLM-plus-formal-constraint architectures, multilingual and cross-jurisdiction transfer, human-centred evaluation protocols, and open-texture detection in regulatory text. Artificial Intelligence and Law legal NLP normative reasoning explainable AI legal benchmarks EU AI Act argument mining transformer models compliance checking scoping review Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 1. Introduction The relationship between Artificial Intelligence and Law is one of the oldest sustained interdisciplinary research programmes in computer science. Dating back to Sergot et al.'s formalisation of the British Nationality Act as a logic program in 1986, and institutionalised through the ICAIL conference series beginning in 1987 and JURIX in 1988, the field has consistently grappled with a deceptively simple question: to what extent can computational methods usefully model, support, or partially automate legal reasoning and legal practice? Over the fifteen years from 2011 to 2026, the contours of that question have shifted dramatically. In 2011, the dominant paradigm remained largely symbolic: researchers designed rule-based systems, argumentation frameworks, case-based reasoners, and formal ontologies to capture the structure of legal norms, precedents, and interpretive arguments. By 2016, deep learning had begun reshaping adjacent fields such as natural language processing, and by 2019 transformer-based language models—pre-trained on large corpora and fine-tuned for downstream tasks—had become the default experimental setting for virtually every text-classification, retrieval, and question-answering task, including legal ones. The period under review thus spans a profound methodological transition, but one that has not simply displaced the earlier tradition. Instead, the field has come to recognise that legal applications impose requirements—explicit justification, normative correctness, auditability, procedural fairness—that purely statistical models cannot easily satisfy. The result is a growing body of hybrid work that combines the semantic power of large language models with the inferential precision of formal reasoners, knowledge graphs, and argumentation structures. This transition is further complicated by a rapidly evolving regulatory context. The European Union's Artificial Intelligence Act (Regulation (EU) 2024/1689), which entered into force on 1 August 2024, imposes a risk-based framework with tiered obligations for AI systems deployed in high-risk domains, including the administration of justice. Its phased applicability—prohibitions and AI literacy requirements from February 2025, general-purpose AI model obligations from August 2025, and full applicability of core rules by August 2026—means that legal researchers are simultaneously building the analytical tools needed to understand AI and becoming subject to the regulatory frameworks that govern it. Against this backdrop, the present review pursues four interrelated objectives. First, it maps the thematic and methodological landscape of AI & Law research between 2011 and 2026, drawing primarily on publications in the Springer journal Artificial Intelligence and Law supplemented by major proceedings (ICAIL, JURIX) and benchmark-defining work from NLP venues. Second, it evaluates the methodological families—symbolic, statistical, and hybrid—against the specific demands that legal applications place on AI systems. Third, it analyses the benchmark and dataset infrastructure that has emerged, with attention to evaluation pitfalls that are specific to the legal domain. Fourth, it proposes a concrete research agenda oriented toward the regulatory, technical, and socio-legal challenges ahead. The remainder of this paper is structured as follows. Section 2 describes the review protocol, including search strategy, inclusion and exclusion criteria, and quality assessment. Section 3 presents the triadic taxonomy that structures our thematic synthesis. Section 4 provides an extended analysis of methodological families. Section 5 surveys benchmark datasets, evaluation metrics, and reproducibility challenges. Section 6 examines five application domains through representative case studies. Section 7 addresses ethical, regulatory, and socio-technical challenges, including a detailed engagement with the EU AI Act. Section 8 sets out the research agenda. Section 9 concludes. Figure 1: Milestones in AI & Law Research (1986–2027): Symbolic/Argumentation → Legal NLP/ML → Governance/Regulation [Figure 1 – Timeline to be rendered as diagram in final submission: key milestones include British Nationality Act as logic program (1986); ICAIL founding (1987); JURIX founding (1988); OWL/DL normative modelling (2014); argumentation schema tooling and ADF methodology for precedents (2016); Eunomos XML+ontology system (2016); CLAUDETTE ML system for unfair ToS clauses (2019); XAI via rationales in contract classification (2021); critique of court decision prediction task definitions (2022); transformer survey for LegalAI and ECHR argument mining corpus (2023); zero-shot legal QA on EU legislation and EU AI Act entry into force (2024); EU AI Act prohibitions and GPAI obligations (2025); EU AI Act full applicability (2026); transition deadline for certain high-risk product systems (2027).] 2. Review Protocol 2.1 Review Type and Reporting Standard Given the breadth of the topic—spanning symbolic AI, machine learning, natural language processing, and AI governance—this review is designed as a systematic scoping review following the PRISMA Extension for Scoping Reviews (PRISMA-ScR) checklist (Tricco et al., 2018 ). Scoping reviews are appropriate when the aim is to map a broad evidence base, identify key concepts and research gaps, and synthesise diverse methodologies rather than to assess the effectiveness of a specific intervention. Where possible, we have incorporated elements of systematic review rigour: explicit search documentation, reproducible inclusion/exclusion criteria, and structured quality appraisal. The SALSA framework (Search, AppraisaL, Synthesis, Analysis) serves as our process model, enabling us to present a traceable sequence from database queries to final thematic synthesis, and to maintain a clear distinction between descriptive mapping and interpretive analysis. 2.2 Databases and Sources We employed a multi-source search strategy reflecting the interdisciplinary character of AI & Law research. The primary database for journal articles was SpringerLink, with specific attention to the full archive of the journal Artificial Intelligence and Law. Scopus and Web of Science were queried for supplementary citation mapping and to identify highly cited works outside the primary journal. For legal NLP and benchmark-oriented work, the ACL Anthology provided access to computational linguistics proceedings; NeurIPS proceedings (Datasets and Benchmarks track) yielded key benchmark papers; and arXiv was consulted for recent preprints where peer-reviewed versions were unavailable. ICAIL (International Conference on AI and Law) and JURIX (International Conference on Legal Knowledge and Information Systems) proceedings were treated as primary sources for argumentation, case-based reasoning, normative logic, and legal information retrieval. COLIEE (Competition on Legal Information Extraction/Entailment) results and overview papers serve as a state-of-practice reference for legal IR and entailment tasks. Dutch policy and regulatory sources—including publications from the Autoriteit Persoonsgegevens, the Raad voor de rechtspraak, and the Wetenschappelijke Raad voor het Regeringsbeleid—contextualise national implementation of EU frameworks. 2.3 Search Terms and Temporal Scope The temporal scope spans January 2011 to February 2026, with limited reference to pre-2011 seminal works (e.g., Sergot et al., 1986 ; the founding of ICAIL in 1987) for contextualisation only. An example search string, to be reproduced in the submission appendix, is as follows: ("Artificial Intelligence and Law" OR "AI and law" OR "computational legal reasoning" OR "legal argumentation" OR "legal ontology" OR "normative reasoning" OR "compliance checking") AND ("transformer" OR "BERT" OR "large language model" OR "legal NLP" OR "information retrieval" OR "textual entailment" OR "judgment prediction" OR "explainable AI") This core string was adapted for each database according to its syntax conventions. Additional domain-specific terms—'COLIEE', 'CUAD', 'LexGLUE', 'LegalBench', 'FOIA', 'compliance checking', 'case-based reasoning', 'defeasible logic'—were appended iteratively as new relevant clusters emerged during screening. 2.4 Inclusion and Exclusion Criteria Inclusion criteria: (1) peer-reviewed articles or authoritative proceedings/benchmark papers published within 2011–2026; (2) direct relevance to AI techniques applied to legal tasks, or to formal legal reasoning with computational models; (3) preferably, explicit problem definition, reproducible method or data, and formal evaluation. Exclusion criteria: (1) pure legal scholarship without a computational model or empirical evaluation; (2) tools or systems lacking sufficient methodological transparency, unless treated as 'industry practice' examples in a dedicated subsection; (3) works whose primary domain is adjacent (e.g., e-health decision support, general IR) without a substantive legal dimension. 2.5 Screening and Appraisal Titles and abstracts were screened against inclusion criteria in a first pass. Full-text review was applied to all potentially relevant records. Quality appraisal focused on problem definition clarity, data provenance, evaluation design, and relevance to the triadic taxonomy developed in Section 3 . Given the scoping review design, low methodological quality was not grounds for exclusion but was noted in the synthesis as a limitation of specific clusters. [Figure 2: PRISMA-ScR flow diagram—records identified, screened, assessed for eligibility, and included—to be inserted here in the final submission.] 3. Taxonomy and Thematic Synthesis 3.1 A Triadic Taxonomy of AI & Law Research A working taxonomy that is practically useful for readers of Artificial Intelligence and Law must capture both the historical profile of the field—built on formal models, argumentation, and case-based reasoning—and the present moment, in which transformer-based language models have become the default experimental setting and regulatory governance has emerged as a research object in its own right. We propose a triadic taxonomy centred on three thematic families: Text-Centric (Legal NLP): Research in this cluster treats legal language as its primary object, developing and evaluating models for classification, information retrieval, semantic entailment, question answering, summarisation, named entity recognition, and argument mining over legal corpora. The defining characteristic is that success is measured primarily through performance on textual tasks, often against benchmark datasets with standardised metrics. Reasoning-Centric (Legal Reasoning): Research in this cluster is concerned with the structure of legal inference: how normative rules interact, how exceptions and conflicts are resolved, how interpretive arguments are constructed, how precedents are analogised, and how probabilities and narratives combine in evidential reasoning. Methods range from defeasible logic and argumentation frameworks to Bayesian networks, case-based reasoning systems, and formal ontologies. The defining characteristic is explicit attention to inferential validity, not just predictive performance. Governance-Centric (AI in/over Law): Research in this cluster examines the conditions under which AI systems can be deployed justifiably in legal contexts, and the regulatory frameworks that govern such deployment. Topics include algorithmic fairness and bias, transparency and explainability, privacy and data governance, human oversight, and compliance with specific regulatory instruments such as the EU AI Act and the GDPR. These three families are not mutually exclusive: the most significant recent contributions tend to operate across boundaries, combining NLP with formal reasoning for interpretability, or embedding governance analysis within empirical studies of system performance. The triadic structure nonetheless offers a useful heuristic for mapping the literature and identifying under-served areas. 3.2 Method Families and Their Distribution Across all three thematic families, the literature deploys methods from three broad methodological lineages: symbolic AI (logic, rules, argumentation, ontologies), statistical/ML AI (classical machine learning, deep learning, transformer language models), and hybrid AI (combinations that typically pair a statistical front-end for language understanding with a formal back-end for inference, constraint satisfaction, or audit logging). The distribution of these method families has shifted markedly across the period under review, with statistical and hybrid methods gaining ground while symbolic methods retain a strong presence in reasoning-centric and governance-centric research. [Figure 3: 2D landscape map, Task family (text-centric ↔ reasoning-centric) × Model family (symbolic ↔ statistical), with governance risk as a colour dimension—to be rendered as a scatter/bubble plot in the final submission.] 4. Methodological Approaches 4.1 Symbolic and Formal Methods The symbolic tradition in AI & Law rests on the insight that legal norms have a structure that can be captured by formal representations: they prescribe, prohibit, or permit actions; they apply conditionally; they can conflict; and they can be overridden by more specific or more recent norms. Deontic logic, defeasible reasoning, and argumentation frameworks are the primary tools for modelling this structure. Normativity and Defeasibility: Legal systems are not monotonic—new evidence, exceptions, and higher-level norms can defeat prima facie conclusions. Defeasible logics, in various forms, have been developed precisely to handle this feature. Compliance checking work by Robaldo et al. ( 2023 ) demonstrates that propositional-level formalisms are often insufficient for realistic legal applications: compliance conditions apply to entire populations of individuals, making first-order representations necessary. Their comparative evaluation of multiple freely available reasoners—including answer set programming (ASP) systems and OWL-based reasoners—against a shared use case reveals a fundamental tension: reasoners with strong explainability properties tend to be computationally less efficient, while highly scalable systems such as ASP often lack built-in explanation mechanisms. This explainability-efficiency trade-off is one of the structural challenges the field has yet to fully resolve. Argumentation Frameworks: Argumentation has long been recognised as particularly well-suited to legal reasoning because legal decisions characteristically involve weighing competing considerations rather than deriving conclusions from unambiguous premises. Walton, Sartor, and Macagno ( 2016 ) model statutory interpretation as a multi-scheme argumentation process that balances pro and contra arguments, providing both a logical formalisation of pro-tanto versus all-things-considered conclusions and tooling for visualisation and evaluation. Al-Abdulkarim, Atkinson, and Bench-Capon ( 2016 ) develop a design methodology for case-based reasoning systems using Abstract Dialectical Frameworks (ADFs), drawing an illuminating analogy with entity-relationship modelling to emphasise the engineering discipline required in legal knowledge representation. Ontologies and Knowledge Representation: The Eunomos system (Boella et al., 2016 ) combines XML-structured legal documents with ontological modelling to provide a web-based knowledge management platform for legislative information. Francesconi ( 2014 ) develops OWL Description Logic patterns for normative relations and integrates them with retrieval architectures, demonstrating that formal semantic representations can simultaneously support human browsing, formal querying, and inferential reasoning over legislation. Probabilistic and Narrative Methods: Evidence in legal proceedings is characteristically uncertain and fragmentary. Vlek et al. ( 2016 ) develop a methodology for explaining Bayesian network models of legal evidence using structured scenarios, addressing a critical weakness of probabilistic models: their opacity to lay decision-makers. By providing scenario-based narratives as interfaces to underlying probability calculations, the approach bridges formal inference and human comprehension. 4.2 Statistical and Machine Learning Methods The statistical tradition in AI & Law gained momentum with the application of classical machine learning—support vector machines, logistic regression, random forests—to legal text classification tasks. The defining feature of this paradigm is that performance is measured empirically against labelled data rather than verified against formal specifications. Classical Machine Learning: The CLAUDETTE system (Lippi et al., 2019 ) exemplifies the classical ML approach applied to a practically significant legal problem: the automated detection of potentially unfair clauses in online Terms of Service agreements. Using supervised learning over a hand-annotated corpus, the system achieves competitive performance on clause classification and establishes a foundational benchmark for subsequent work. The choice of SVM-based models reflects the limited training data available—a persistent constraint in legal NLP where expert annotation is expensive. Transformer-Based Models: The arrival of BERT and its successors transformed legal NLP research with remarkable speed. Greco and Tagarelli ( 2024 ) provide a systematic survey of transformer-based language models applied to legal AI tasks, organising an extensive literature by task type and model architecture. Their survey is methodologically notable for its explicit attention to the relationship between model choice and task characteristics: not all legal NLP tasks benefit equally from pre-training on general corpora, and domain-specific pre-training on legal text (e.g., LegalBERT, CamemBERT trained on French legal corpora) yields consistent improvements on tasks requiring precise legal terminology or citation structure. Large Language Models and Zero-Shot Generalisation: The most recent cohort of large language models (LLMs)—including GPT-4 class models and instruction-tuned variants of the LLaMA and Mistral families—have introduced a new mode of system development: zero-shot or few-shot prompting without task-specific fine-tuning. Sovrano et al. ( 2025 ) investigate zero-shot legal question answering on European legislation using discourse-based retrieval selection (DiscoLQA), demonstrating that careful retrieval design—exploiting the discourse structure of EU legislative texts—can substantially reduce hallucination and improve answer precision even without fine-tuning. Their work illustrates a general principle: for LLM-based legal applications, the quality of the retrieval and grounding pipeline often matters more than raw model scale. Court Decision Prediction: A significant sub-literature has developed around the prediction of judicial outcomes. Medvedeva, Wieling, and Vols ( 2023 ) provide an important methodological intervention in this area, arguing that 'court decision prediction' conflates at least three distinct tasks that make very different demands on models and that have very different implications for legal practice: outcome identification (labelling an existing decision's outcome), judgement categorisation (classifying decisions by legal issue or ground), and outcome forecasting (predicting the outcome of a pending case). Their taxonomy has direct implications for benchmark design: datasets and evaluation metrics appropriate for one task may be systematically misleading for another. Benedetto et al. ( 2025 ) extend decision-support in this domain by combining judgment prediction with sentence-level explanation, using legal named entity recognition masking and entity-aware transformer architectures to generate predictions that are traceable to specific evidentiary or normative elements of the case. 4.3 Hybrid and Explainability-Oriented Methods Hybrid approaches that combine statistical and symbolic components have attracted growing interest as the limitations of purely neural systems for legal applications have become apparent. The central motivation is that legal decisions must be justifiable—they must be explainable in terms of legal norms, precedents, and facts—not merely accurate by some external criterion. Rationale-Based Explanation: Ruggeri et al. ( 2022 ) develop a memory-augmented neural network architecture for detecting and explaining unfair clauses in consumer contracts. The system produces not just a classification but a legal rationale: a set of sentences or phrases that provide a human-readable justification for the prediction. Their evaluation demonstrates that rationale-based models can improve both predictive accuracy and explanation quality, and that users find rationale-equipped systems significantly more useful than those providing only labels. XAI and the Black Box Problem: Brożek et al. ( 2024 ) provide a philosophically rigorous analysis of the 'black box problem' as it manifests in legal AI applications. Rather than treating opacity as a single problem, they decompose it into four analytically distinct components: opacity (the model's internal workings are not accessible), strangeness (the model's decision-making process does not resemble recognisable human reasoning), unpredictability (the model's outputs cannot be reliably anticipated), and the justification gap (the model cannot provide a legally adequate rationale). This four-part analysis is valuable for researchers because it clarifies which component of the black box problem a given XAI technique actually addresses—and which it leaves open. Hybrid Architectures for Compliance: The compliance checking literature exemplifies the hybrid approach at the architectural level. A promising direction combines LLM-based semantic extraction—identifying the relevant facts and norms in natural language documents—with formal constraint solvers that check compliance, detect conflicts, and generate audit logs. Robaldo et al.'s ( 2023 ) comparative evaluation suggests that this architecture is technically feasible but that current technology fragmentation—with multiple incompatible formalisms and tools—is a significant obstacle to production deployment. Table 1 Comparison of Methodological Families in AI & Law Research Method Family Representation Legal Strengths Weaknesses/Risks Key AIL Example Rule/logic-based (symbolic) Deontic rules, defeasibility, constraints Transparent inference; explicit exceptions; compliance/norm conflicts Knowledge engineering cost; scalability gaps Compliance reasoner comparison on first-order use case (Robaldo et al., 2023 ) Argumentation-based (symbolic) Argument schemes, pro/contra, acceptability Fits legal justification; interpretive and evidential reasoning Formalisation choices affect outcomes; domain-specific evaluation Statutory interpretation as argumentation schemes (Walton et al., 2016 ) Case-based reasoning / precedent Factors, issues/values, case graphs Matches precedent dynamics and analogical reasoning Factor representation is labour-intensive; limited cross-domain transfer ADF methodology for factor-based cases (Al-Abdulkarim et al., 2016 ) Probabilistic + narrative Bayesian networks + scenarios Rigorous in evidential reasoning; scenario narratives aid comprehension Modelling expertise required; risk of model outsourcing Explaining Bayesian networks via scenario schemes (Vlek et al., 2016 ) Classical ML (SVM, etc.) Feature space + labels Robust with limited data; useful baseline for clause detection Limited semantic generalisation; shallow explanation CLAUDETTE: ML detection of unfair clauses (Lippi et al., 2019 ) Deep learning / transformers Pre-training + fine-tuning / in-context State-of-the-art on most legal NLP tasks; scalable Hallucination; justification gap; domain shift; governance demands Systematic transformer review for LegalAI (Greco & Tagarelli, 2024 ) XAI / rationale-based Rationales, sentence-level explanation, attribution Increases acceptability and auditability; connects to legal motivation Explanation may be cosmetic; requires human evaluation for validation Memory networks for ToS explanation (Ruggeri et al., 2022 ) KR / ontologies and semantic web RDF/OWL, XML, Hohfeld relations Structural interoperability; formal querying; normative relation modelling Ontology engineering cost; mismatch with raw natural language; maintenance at law change OWL-DL patterns for normative retrieval (Francesconi, 2014 ); Eunomos (Boella et al., 2016 ) 5. Data, Benchmarks, and Evaluation 5.1 Why Legal Data is Different AI systems for legal applications work almost exclusively with language—statutes, judgments, contracts, administrative decisions—and with institutional context: procedural rules, jurisdictional constraints, norm hierarchies. These properties make the construction and use of legal datasets substantially more complex than in most general-purpose NLP applications. First, labels in legal datasets are characteristically interpretive. Whether a clause in a contract is 'unfair,' whether a judicial outcome is 'predictable,' or whether a piece of text 'entails' a legal conclusion are questions that often admit of principled disagreement among experts—and that disagreement is not noise but signal about the inherent interpretive structure of law. Several contributions in the AI & Law literature explicitly model annotator disagreement as a substantive finding rather than a problem to be minimised (Habernal et al., 2024 ). Second, legal datasets are typically jurisdiction-specific and language-specific in ways that reduce their generalisability. A dataset of European Court of Human Rights decisions is not straightforwardly applicable to common law contexts; a dataset of Dutch case law raises different challenges from one based on US federal contracts. This jurisdictional fragmentation is one of the most significant structural barriers to cumulative progress in the field. Third, evaluation in legal AI frequently requires domain experts whose time is scarce and expensive. Standard automated metrics—accuracy, F1, exact match—may poorly capture legally relevant distinctions. A system that correctly predicts the outcome of a judgment but cannot produce a legally adequate rationale is of limited practical value; conversely, a system with lower predictive accuracy but richer explanatory output may be more useful to practitioners. 5.2 Key Benchmarks and Datasets Table 2 summarises the principal benchmark datasets and evaluation resources that have shaped empirical research in the period under review. Table 2 Key Benchmarks and Datasets in AI & Law Research (2011–2026) Dataset / Benchmark Task Type Domain / Jurisdiction Lang. Scale (indicative) Typical Metrics Primary Source LexGLUE NLU suite (multi-task) Diverse legal NLU tasks EN Bundle of datasets; standardised evaluation Acc / F1 (task-dependent) Chalkidis et al. ( 2022 ) LegalBench Legal reasoning for LLMs Broad (162 tasks; 6 reasoning types) EN 162 tasks; multi-LLM evaluation Task-specific; Acc / F1 Guha et al. ( 2023 ) CUAD Contract clause review (span/label) Commercial contracts (EDGAR) EN 13k+ annotations; 510 contracts F1 / EM span scores; per-label Acc Hendrycks et al. ( 2021 ) COLIEE Legal IR + entailment/QA competition Case law + statute law (JP/CA) EN Annual; multiple tasks Micro-F1 (retrieval); Acc per task COLIEE overview (10th edition) ECHR Argument Mining Corpus Argument mining European Court of Human Rights EN 373 decisions; 2.3M tokens; 15k argument spans Span-F1 + expert review Habernal et al. ( 2024 ) NL Case-Law Citation Prediction Classification / prediction Dutch case law NL Judgment-level dataset Acc / F1 + error analysis Schepers et al. ( 2024 ) CLAUDETTE / ToS Unfair Clauses Clause classification Consumer law / Terms of Service EN+ multilingual Multiple clause categories Acc / F1; explanation evaluation Lippi et al. ( 2019 ); Ruggeri et al. ( 2022 ); Galassi et al. ( 2025 ) FOIA Deliberative Language Sensitive text detection Open-records / FOIA (US federal) EN Newly annotated training set; operational tests Precision / Recall + usability Branting et al. ( 2025 ) 5.3 Evaluation Metrics and Their Limitations Standard NLP metrics—accuracy, macro/micro F1, exact match, mean average precision—are necessary but not sufficient for evaluating legal AI systems. Several systematic limitations recur across the literature. Asymmetric error costs: In many legal applications, false negatives and false positives carry very different costs. A false negative in sensitive document redaction (a disclosed deliberative communication) may have serious legal consequences; a false positive (an unnecessarily redacted document) may frustrate transparency obligations. Standard F1 treats these symmetrically. Calibrated threshold selection and cost-sensitive evaluation are thus important complements to aggregate performance metrics. Jurisdiction and domain shift: Models trained on US contracts may perform poorly on EU equivalents due to differences in legal terminology, contractual structure, and applicable law. Evaluation protocols that report only in-domain test performance can systematically overstate practical utility. Cross-jurisdiction and cross-domain evaluation splits, where feasible, provide more informative performance estimates. Label disagreement: Multiple AI & Law benchmarks involve inherently contestable labels. The ECHR argument mining corpus (Habernal et al., 2024 ) reports inter-annotator agreement statistics alongside model performance and uses expert review as an additional evaluation layer—a methodological standard that should be widely adopted. The justification gap: Perhaps the most fundamental evaluation challenge in legal AI is the absence of standardised protocols for assessing the adequacy of system-generated explanations and justifications. Brożek et al. ( 2024 ) frame this as the 'justification' component of the black box problem: even a system with high predictive accuracy and locally faithful post-hoc explanations may fail to provide a justification that would be acceptable in a legal proceeding. Developing 'justification benchmarks'—evaluation tasks that require herleidbare arguments, normative citations, and structural reasoning—is one of the most important open methodological challenges in the field. 6. Application Domains and Case Studies 6.1 Contract Analysis and Consumer Protection The automated analysis of legal contracts—particularly for detecting unfair, unusual, or high-risk clauses—has emerged as one of the most productive application areas in AI & Law, combining clear practical value with tractable technical problem definitions. The CLAUDETTE project (Lippi et al., 2019 ) established the foundational framework: a supervised machine learning system trained on annotated Terms of Service documents to detect clauses belonging to potentially unfair categories defined by the EU's Unfair Contract Terms Directive. The system demonstrated that ML-based clause detection was technically feasible and practically useful, and it established a benchmark dataset that has supported cumulative research. The subsequent work by Ruggeri et al. ( 2022 ) addressed a critical limitation of the original system: the absence of explanations for its predictions. Using memory-augmented neural networks that learn to attend to relevant 'legal rationale' passages, the extended system produces both a classification and a set of supporting text spans that justify it. Human evaluation confirmed that rationale-equipped systems are substantially more useful to non-expert users than those providing only labels, validating the investment in explanation infrastructure. Galassi et al. ( 2025 ) extend the CLAUDETTE framework to a multilingual EU context, comparing approaches for detecting unfair clauses in Terms of Service documents across multiple European languages. Their findings confirm that multilingual robustness is non-trivial—models trained primarily on English-language data show substantial performance degradation on less-resourced EU languages—and that cross-lingual transfer requires careful attention to both model architecture and training data distribution. From a socio-technical perspective, contract AI is not merely a technical problem. Consumer contracts are sites of power asymmetry: the companies that draft Terms of Service have legal teams; the consumers who agree to them typically do not. AI-based detection systems can support regulators and civil society organisations in identifying systematic unfairness at scale. However, deployment requires careful institutional design: who has access to the system, how false positive rates interact with regulatory burden, and how detection results feed into enforcement processes. 6.2 E-Discovery, Open Records, and Sensitive Text Management Electronic discovery and open-records management represent a second major application cluster, characterised by the need to classify large volumes of documents for legal sensitivity or privileged status under time pressure and with significant legal consequences for errors. Branting et al. ( 2025 ) describe a decision-support system for US Freedom of Information Act (FOIA) requests, designed to assist federal agency reviewers in identifying 'deliberative process' text—internal communications about policy development that are subject to a FOIA exemption. The system was developed using a newly annotated training dataset, evaluated against domain expert annotations, and tested in an operational deployment context with federal agency staff. Methodologically, the FOIA paper is exemplary for its integration of technical evaluation (precision/recall on the annotation task) with human-centred evaluation (usability studies examining time-to-decision, error rates, and reviewer confidence). It demonstrates that user interface design and workflow integration are not merely engineering considerations but substantive research contributions: a technically capable system that is not usable by its intended operators fails as a legal tool, regardless of its benchmark performance. The 'human-in-the-loop' evaluation framework used in this work should be regarded as a methodological standard for high-stakes legal AI applications. 6.3 Compliance Checking and Normative Reasoning Compliance checking—determining whether a system of actions, contracts, or business processes satisfies a set of legal norms—is perhaps the application domain that most directly showcases the strengths and limitations of formal AI methods. Robaldo et al. ( 2023 ) conduct a comparative evaluation of multiple freely available reasoning technologies on a compliance checking use case involving first-order knowledge and compensatory norms. Their evaluation is methodologically important because it treats the choice of reasoning technology as an open research question rather than a presupposition: different formalisms—including OWL-based reasoners, ASP solvers, and Datalog variants—make different trade-offs between expressiveness, computational efficiency, and explanation quality. The paper's central finding—that explainability and efficiency are in tension, with ASP systems offering better scalability but limited explanation facilities while OWL-based systems provide richer explanations at the cost of performance—articulates a structural constraint that hybrid architectures must navigate. An important implication is that compliance checking systems cannot be evaluated on technical correctness alone: the quality and accessibility of the explanations they generate are equally important criteria for legal deployability. 6.4 Legal Interpretation, Argumentation, and Precedent Legal interpretation—determining what a statutory or regulatory text means in a specific factual context—is perhaps the most intellectually demanding task in legal practice, and arguably the one least amenable to purely statistical approaches. Walton, Sartor, and Macagno ( 2016 ) model statutory interpretation as a multi-scheme argumentation process in which competing interpretive arguments—grammatical, systematic, teleological, and analogical—are weighed against each other using argument schemes from rhetoric and informal logic. Their formalisation provides both a theoretical framework for understanding interpretation debates and a computational architecture that supports tooling for visualisation and evaluation. The integration of pro-tanto and all-things-considered reasoning provides a nuanced treatment of how interpretive conclusions can be qualified by context. For precedent reasoning, Al-Abdulkarim, Atkinson, and Bench-Capon ( 2016 ) develop a systematic methodology for designing case-based reasoning systems using Abstract Dialectical Frameworks. Their approach treats the design of a legal reasoning system as an engineering discipline with explicit design choices: which factors are modelled, how issues and values are related, and how the resulting system handles novel cases that do not map cleanly onto existing precedents. The analogy with entity-relationship modelling in database design is apt: legal knowledge representation requires the same combination of domain expertise, representational discipline, and iterative refinement that complex data modelling demands. Habernal et al. ( 2024 ) approach legal argumentation from the NLP direction, constructing a large annotated corpus of argument spans from European Court of Human Rights decisions. Their annotation schema is explicitly grounded in legal theory—distinguishing conclusion, premise, and epistemic support in legally meaningful ways—and their evaluation combines automated metrics with expert review. The paper establishes both a benchmark and a methodology that other argument mining researchers can adopt. 6.5 Adjudication Support: Prediction, Explanation, and Citation Research on AI support for judicial decision-making has grown substantially in the review period, fuelled both by increasing availability of digitised case law and by commercial interest in litigation prediction tools. As noted above, Medvedeva, Wieling, and Vols ( 2023 ) provide an essential methodological corrective by distinguishing three distinct tasks that are routinely conflated under the 'court decision prediction' label. Their taxonomy—outcome identification, judgement categorisation, and outcome forecasting—has significant implications for how benchmarks should be designed and how results should be interpreted. Outcome identification (labelling a known decision) and outcome forecasting (predicting an unknown pending decision) are not merely different in difficulty; they make fundamentally different demands on models and have entirely different practical implications. Benedetto et al. ( 2025 ) address the explanation gap in prediction-oriented research by combining judgment prediction with sentence-level explanation. Their entity-aware transformer architecture, which leverages legal named entity recognition to mask and attend to legally significant entities, produces predictions that are traceable to specific passages in the judgment. The approach also addresses privacy concerns by allowing selective anonymisation of legally sensitive entities. Schepers et al. ( 2024 ) investigate a distinct but practically important question in the Dutch legal context: given the increasing volume of published case law, can NLP models predict whether a judgment will subsequently be cited by other courts? Citation prediction is proposed as a proxy for 'legal authority'—a judgment that is frequently cited by other courts has, in some sense, made law. The paper combines careful NLP methodology with legal domain analysis, situating citation patterns within the institutional structure of the Dutch court system. 7. Ethical, Regulatory, and Socio-Technical Challenges 7.1 The EU AI Act and Its Implications for AI & Law Research The EU Artificial Intelligence Act (Regulation (EU) 2024/1689), which entered into force on 1 August 2024, represents the most significant regulatory development for AI & Law research since the field's founding. Its risk-based framework classifies AI systems into prohibited, high-risk, and limited-risk categories, with 'administration of justice and democratic processes' explicitly listed as a high-risk domain. For AI systems deployed in this domain, the Act imposes obligations regarding data quality and governance, technical documentation, transparency and traceability, accuracy and robustness, and human oversight. These obligations are not merely compliance requirements for commercial vendors: they define a research agenda. Systems that cannot be documented, audited, or subjected to meaningful human oversight are, under the Act, legally non-deployable in high-risk legal contexts regardless of their technical performance. This creates a powerful institutional incentive for the kind of explainability, auditability, and formal verification research that the symbolic and hybrid traditions in AI & Law have long advocated. The phased implementation timeline is important for researchers to understand. The prohibitions on clearly unacceptable AI practices and AI literacy requirements entered into application in February 2025. Obligations for general-purpose AI (GPAI) models—including foundation models used as components of legal AI systems—became applicable in August 2025. The core obligations for high-risk AI systems will apply from August 2026, with a transitional period extending to August 2027 for certain high-risk product systems already on the market. The Act's transparency requirements are particularly relevant for the transformer-based systems that have become dominant in legal NLP. Systems that make automated decisions affecting individuals must be able to explain those decisions in terms that the affected persons can understand. Post-hoc explanations of the kind generated by SHAP or LIME-style methods may not satisfy this requirement if they cannot be connected to legally meaningful reasoning steps. 7.2 Fairness, Discrimination, and Sensitive Personal Data The use of personal data in AI-based legal decision-support raises both GDPR compliance issues and substantive fairness concerns. Žliobaitė and Custers ( 2016 ) surface a non-obvious tension: in some circumstances, using sensitive personal data (such as race or gender) in a decision model may be necessary to detect and correct for existing discrimination in training data. Excluding these variables without understanding the causal structure of the data generating process can preserve or amplify historical biases under a superficial appearance of fairness. The Dutch Data Protection Authority (Autoriteit Persoonsgegevens) has emphasised that all algorithmic processing of personal data must comply with GDPR requirements, including purpose limitation, data minimisation, transparency, and rights of data subjects. For legal AI researchers, this means that dataset construction, model training, and system deployment must be embedded in a privacy-by-design framework that goes beyond standard anonymisation procedures. 7.3 Human Oversight and the Automation Paradox A recurring theme across AI & Law applications is the tension between efficiency gains from automation and the legal requirements for human accountability. Legal decisions that affect individuals must, in most jurisdictions, be made by or under the meaningful oversight of a responsible human decision-maker. AI systems that are too complex for their human operators to understand or monitor effectively may undermine this requirement even when they are formally embedded in human-in-the-loop workflows. The FOIA assistant evaluated by Branting et al. ( 2025 ) illustrates this challenge: their usability studies reveal that reviewers often find it difficult to critically evaluate or override system recommendations for documents they have not independently reviewed. This 'automation paradox'—where increasing system capability reduces the effective oversight that human operators exercise—is a fundamental socio-technical challenge for high-stakes legal AI deployment, and one that purely technical research cannot resolve. 8. Research Agenda and Open Challenges Our synthesis identifies five priority areas for AI & Law research in the coming period. 8.1 From Accuracy to Legal Validity: Justification-Oriented Benchmarks Current legal AI benchmarks primarily reward correct predictions against labelled data. What they systematically fail to capture is whether a system's output is legally valid—whether it respects the hierarchy of norms, correctly applies exceptions, provides adequate justification, and exhibits procedural fairness. We propose the development of 'justification benchmarks' that require systems to produce not just predictions but traceable justifications: citations to relevant norms and precedents, explicit handling of exceptions and conflicts, and structured arguments that follow legally recognised inference patterns. Such benchmarks would require collaboration between computational researchers and legal domain experts and would constitute a major methodological contribution to the field. 8.2 Hybrid LLM-Plus-Formal-Constraint Architectures The evidence reviewed suggests that neither purely neural nor purely symbolic approaches are sufficient for the full range of legal AI tasks. LLMs excel at natural language understanding and generation but lack reliable mechanisms for normative consistency or audit trail generation. Formal reasoners excel at constraint satisfaction and explanation generation but struggle with the ambiguity and variability of natural legal language. The most promising architectural direction combines these complementarities: LLMs as semantic front-ends for information extraction and hypothesis generation, coupled with formal constraint engines (ASP, SHACL, OWL-based reasoners, or normative constraint languages) for final inference, conflict detection, and audit logging. Realising this architecture requires advances in semantic parsing, uncertainty representation, and interoperability between NLP and formal reasoning components. 8.3 Multilingual Benchmarks and Cross-Jurisdiction Transfer EU legal contexts require multilingual robustness across at least 24 official languages, yet the benchmark infrastructure for non-English legal NLP remains sparse. Cross-jurisdiction transfer between common law and civil law systems raises additional challenges related to different legal concepts, institutional structures, and interpretive traditions. A concrete research agenda would develop multilingual benchmarks with explicit juridical-comparative label definitions—i.e., annotation guidelines that account for systematic differences in legal concepts across jurisdictions—and would incorporate annotator disagreement modelling as a first-class component of the evaluation framework. 8.4 Human-Centred Evaluation as a First-Class Method The FOIA assistant work and the rationale-explanation literature both suggest that human-centred evaluation—measuring actual impact on decision quality, workload, accuracy, and appropriate trust calibration—is at least as important as automated metric performance for legal AI systems. We recommend that the field develop standardised protocols for human-in-the-loop evaluation of legal AI systems, including pre-registered study designs, standardised outcome measures (time-to-decision, error rate, correction behaviour, trust and confidence calibration), and guidelines for meaningful usability testing with domain experts. 8.5 Open-Texture Detection and Regulatory Interpretation Open texture—the phenomenon by which apparently clear legal terms acquire indeterminate applications at the margins—is a fundamental feature of law that any system intended to support regulatory interpretation must handle. Several AI & Law contributions have proposed that detecting open-texture terms is a necessary prerequisite for reliable regulatory automation. LLMs may offer new capabilities in this area, given their ability to generate multiple candidate interpretations, but evaluation requires legally grounded annotation: which terms are 'open texture,' in what contexts, and with what degree of interpretive controversy. A dataset of open-texture terms annotated by legal experts, with accompanying explanation requirements, would be a valuable community resource. 9. Conclusion This systematic scoping review has mapped fifteen years of research at the intersection of Artificial Intelligence and Law, from 2011 to 2026. The overarching narrative is one of productive but unresolved tension between two paradigms: the formal, symbol-based tradition that prizes explicability, normative correctness, and procedural legitimacy; and the statistical, data-driven tradition that prizes empirical performance, scalability, and practical usability. The period under review has seen remarkable progress on both fronts. Symbolic methods have become more computationally tractable, more amenable to integration with ontological and knowledge graph infrastructure, and more sophisticated in their treatment of argumentation, defeasibility, and uncertainty. Statistical methods have been transformed by the pre-training paradigm, with transformer-based language models achieving human-competitive performance on many legal NLP benchmarks. Hybrid architectures are emerging that seek to combine the complementary strengths of both traditions. At the same time, several structural challenges persist. The gap between predictive accuracy and legal justifiability remains wide and is not bridged by current XAI techniques. Benchmark infrastructure, though greatly improved, remains concentrated in English-language, common law contexts and does not adequately capture the juridical validity of system outputs. Human-centred evaluation is underused relative to its importance for high-stakes legal applications. And the regulatory context—particularly the EU AI Act's explicit requirements for high-risk AI systems in the justice domain—is reshaping the design space for legal AI in ways that amplify the need for exactly the formal verification, documentation, and human oversight capabilities that the field's founding tradition championed. The research agenda we have proposed—justification benchmarks, hybrid architectures, multilingual transfer, human-centred protocols, and open-texture detection—is ambitious but not unrealistic. Crucially, it requires collaboration between computational researchers, legal scholars, and practitioners that goes beyond the surface-level interdisciplinarity of importing legal data into NLP pipelines. Addressing the deepest challenges of AI & Law requires bringing legal reasoning—in its full interpretive, argumentative, and institutional complexity—to the centre of the technical research agenda. Declarations Conflict of Interest : The author declares no conflicts of interest. Funding : This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Data Availability : No new datasets were generated or analysed for this review. All datasets referenced are described in the relevant primary sources cited in the reference list. AI Tool Usage Disclosure : Consistent with Springer policy, large language model assistance was used for editing support; the intellectual content, analysis, synthesis, and all substantive claims are solely the work of the named author. Author Contribution A systematic scoping review (2011–Feb 2026) of AI & Law research following a PRISMA-ScR–informed protocolA triadic taxonomy of the field (text-centric / reasoning-centric / governance-centric) and mapping of methods (symbolic / statistical / hybrid), benchmarks, and application domainsA synthesis arguing the field has moved from “AI or Law” toward hybrid socio-technical systems balancing legal justification/traceability with empirical performance/reproducibilityA research agenda with five open challenges (justification benchmarks; LLM+formal constraints; multilingual/cross-jurisdiction transfer; human-centred evaluation; open-texture detection) References Al-Abdulkarim L, Atkinson K, Bench-Capon T (2016) A methodology for designing systems to reason with legal cases using Abstract Dialectical Frameworks. Artif Intell Law 24:1–49 Benedetto I, Koudounas A, Vaiani L et al (2025) Boosting court judgment prediction and explanation using legal entities. Artif Intell Law 33:605–640 Boella G, Di Caro L, Humphreys L et al (2016) Eunomos, a legal document and knowledge management system for the Web to provide relevant, reliable and up-to-date information on the law. Artif Intell Law 24:245–283 Branting K, Brown B, Giannella C et al (2025) Decision support for detecting sensitive text in government records. Artif Intell Law 33:171–197 Brożek B, Furman M, Jakubiec M, Kucharzyk B (2024) The black box problem revisited. Real and imaginary challenges for automated legal decision making. Artif Intell Law 32:427–440 Chalkidis I, Jana A, Hartung D, Bommarito M, Androutsopoulos I, Katz DM, Aletras N (2022) LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. Proceedings of ACL 2022 European Commission (2024), 1 August AI Act enters into force European Union (2024) Regulation (EU) 2024/1689 of the European Parliament and of the Council (Artificial Intelligence Act). Official Journal of the European Union Francesconi E (2014) A description logic framework for advanced accessing and reasoning over normative provisions. Artif Intell Law 22:291–311 Galassi A, Lagioia F, Jabłonowska A et al (2025) Unfair clause detection in terms of service across multiple languages. Artif Intell Law 33:641–689 Grant MJ, Booth A (2009) A typology of reviews: an analysis of 14 review types and associated methodologies. Health Inf Libr J 26(2):91–108 Greco CM, Tagarelli A (2024) Bringing order into the realm of Transformer-based language models for artificial intelligence and law. Artif Intell Law 32:863–1010 Guha N, Nyarko J, Ho DE et al (2023) LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. Proceedings of NeurIPS 2023 (Datasets and Benchmarks Track) Habernal I, Faber D, Recchia N et al (2024) Mining legal arguments in court decisions. Artif Intell Law 32:1–38 Hendrycks D, Burns C, Chen A, Ball S (2021) CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review. Proceedings of NeurIPS 2021 (Datasets and Benchmarks Track) Lippi M, Pałka P, Contissa G et al (2019) CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service. Artif Intell Law 27:117–139 Medvedeva M, Wieling M, Vols M (2023) Rethinking the field of automatic prediction of court decisions. Artif Intell Law 31:195–212 Page MJ, McKenzie JE, Bossuyt PM et al (2021) The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 372:n71 Robaldo L, Batsakis S, Calegari R et al (2023) Compliance checking on first-order knowledge with conflicting and compensatory norms: a comparison among currently available technologies. Artificial Intelligence and Law Ruggeri F, Lagioia F, Lippi M, Torroni P (2022) Detecting and explaining unfairness in consumer contracts through memory networks. Artif Intell Law 30:59–92 Schepers I, Medvedeva M, Bruijn M, Wieling M, Vols M (2024) Predicting citations in Dutch case law with natural language processing. Artif Intell Law 32:807–837 Sergot MJ, Sadri F, Kowalski RA, Kriwaczek F, Hammond P, Cory HT (1986) The British Nationality Act as a logic program. Commun ACM 29(5):370–386 Sovrano F, Palmirani M, Sapienza S, Pistone V (2025) DiscoLQA: zero-shot discourse-based legal question answering on European Legislation. Artif Intell Law 33:323–359 Tricco AC, Lillie E, Zarin W et al (2018) PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann Intern Med 169(7):467–473 Vlek CS, Prakken H, Renooij S, Verheij B (2016) A method for explaining Bayesian networks for legal evidence with scenarios. Artificial Intelligence and Law Walton D, Sartor G, Macagno F (2016) An argumentation framework for contested cases of statutory interpretation. Artif Intell Law 24:51–91 Žliobaitė I, Custers B (2016) Using sensitive personal data may be necessary for avoiding discrimination in data-driven decision models. Artif Intell Law 24:183–201 Autoriteit Persoonsgegevens (2025) Regels bij gebruik van AI & algoritmes (AVG-kader) Autoriteit Persoonsgegevens (2026) Visie op generatieve AI: zonder duidelijke waarden dreigt 'Wilde Westen' Raad voor de rechtspraak (2019–2025). Rechtstreeks (themanummer: Algoritmes in de rechtspraak); AI voor een rechtvaardige Rechtspraak (AI-strategie) Wetenschappelijke Raad voor het Regeringsbeleid (2021) Artificiële Intelligentie – adviesprojecten en documenten Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8913025","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Systematic Review","associatedPublications":[],"authors":[{"id":593829516,"identity":"c78290b3-b046-4594-bd03-327d1d550e02","order_by":0,"name":"Pradeep Kumar","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA/ElEQVRIiWNgGAWjYJACCQjF2MDAUGADYjQeIEoLD1iLQRpYL7FaQMDgMJjCq8Vc+vDDGx8q7jDYSx9u/PDD4Lzd2vbDQFtqbKJxabHsSzO2nHHmGQMPX2KzZI/B7eRtZxKBWo6l5Tbg0GJwhsFMmrftMAMPD2MbAw9Qi9kBoBbGhsN4tLB/k/77D6KF8Y/BuWSz8w8JaeExkwYqAGth5jE4YGd2g4Atlj08xZY9x57x8JxhbJaWMUhOMLsBtCUBj1/Medg33vhRc0eOvYf94cc3FXb2ZufTHz74UGOD22EQ6gAPTCARrDIBh3JkLXABezyKR8EoGAWjYIQCAM2VXthMBrHqAAAAAElFTkSuQmCC","orcid":"","institution":"","correspondingAuthor":true,"prefix":"","firstName":"Pradeep","middleName":"","lastName":"Kumar","suffix":""}],"badges":[],"createdAt":"2026-02-19 01:08:30","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8913025/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8913025/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":103042885,"identity":"6d94bc9c-6d49-4725-ad28-4d53c2e2c8d0","added_by":"auto","created_at":"2026-02-20 04:59:42","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":148060,"visible":true,"origin":"","legend":"\u003cp\u003eMilestones in AI \u0026amp; Law Research (1986–2027): Symbolic/Argumentation → Legal NLP/ML → Governance/Regulation\u003c/p\u003e\n\u003cp\u003eFigure 1 – Timeline to be rendered as diagram in final submission: key milestones include British Nationality Act as logic program (1986); ICAIL founding (1987); JURIX founding (1988); OWL/DL normative modelling (2014); argumentation schema tooling and ADF methodology for precedents (2016); Eunomos XML+ontology system (2016); CLAUDETTE ML system for unfair ToS clauses (2019); XAI via rationales in contract classification (2021); critique of court decision prediction task definitions (2022); transformer survey for LegalAI and ECHR argument mining corpus (2023); zero-shot legal QA on EU legislation and EU AI Act entry into force (2024); EU AI Act prohibitions and GPAI obligations (2025); EU AI Act full applicability (2026); transition deadline for certain high-risk product systems (2027).\u003c/p\u003e","description":"","filename":"fig1timeline.png","url":"https://assets-eu.researchsquare.com/files/rs-8913025/v1/9cdd52174fa70a11385d8c85.png"},{"id":103042881,"identity":"18325d4a-99ae-4c88-a903-c2b8847df8ea","added_by":"auto","created_at":"2026-02-20 04:59:42","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":184821,"visible":true,"origin":"","legend":"\u003cp\u003ePRISMA-ScR flow diagram—records identified, screened, assessed for eligibility, and included—to be inserted here in the final submission.\u003c/p\u003e","description":"","filename":"fig2prismaflow.png","url":"https://assets-eu.researchsquare.com/files/rs-8913025/v1/cd11b4fbb6e51c6d5c78ba3a.png"},{"id":103042883,"identity":"da6eec02-fa61-4525-af0b-3410686365bc","added_by":"auto","created_at":"2026-02-20 04:59:42","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":283933,"visible":true,"origin":"","legend":"\u003cp\u003e2D landscape map, Task family (text-centric ↔ reasoning-centric) × Model family (symbolic ↔ statistical), with governance risk as a colour dimension—to be rendered as a scatter/bubble plot in the final submission.\u003c/p\u003e","description":"","filename":"fig3researchlandscape.png","url":"https://assets-eu.researchsquare.com/files/rs-8913025/v1/3bc05d38549e414c0953bbc4.png"},{"id":103050443,"identity":"4cdfa61b-ec2a-44fb-a218-7b8edd9e201a","added_by":"auto","created_at":"2026-02-20 07:50:04","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":185865,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend.\u003c/p\u003e","description":"","filename":"fig4pipelinearchitecture.png","url":"https://assets-eu.researchsquare.com/files/rs-8913025/v1/486620326979be882242e99e.png"},{"id":103050606,"identity":"1d11f914-61e8-4b33-ba0c-8064f41d09ea","added_by":"auto","created_at":"2026-02-20 07:50:46","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":181038,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend.\u003c/p\u003e","description":"","filename":"fig5evaluationmatrix.png","url":"https://assets-eu.researchsquare.com/files/rs-8913025/v1/8df5927d0bff04096f00d036.png"},{"id":103199260,"identity":"37f4efb0-f739-4f13-bee4-ac5c2fb291d6","added_by":"auto","created_at":"2026-02-23 05:25:54","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1965980,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8913025/v1/7c4d054c-6667-49f9-8067-e7a5663ad23a.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eArtificial Intelligence and Law, 2011–2026: A Systematic Scoping Review of Methods, Benchmarks, and Open Challenges\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eThe relationship between Artificial Intelligence and Law is one of the oldest sustained interdisciplinary research programmes in computer science. Dating back to Sergot et al.'s formalisation of the British Nationality Act as a logic program in 1986, and institutionalised through the ICAIL conference series beginning in 1987 and JURIX in 1988, the field has consistently grappled with a deceptively simple question: to what extent can computational methods usefully model, support, or partially automate legal reasoning and legal practice?\u003c/p\u003e \u003cp\u003eOver the fifteen years from 2011 to 2026, the contours of that question have shifted dramatically. In 2011, the dominant paradigm remained largely symbolic: researchers designed rule-based systems, argumentation frameworks, case-based reasoners, and formal ontologies to capture the structure of legal norms, precedents, and interpretive arguments. By 2016, deep learning had begun reshaping adjacent fields such as natural language processing, and by 2019 transformer-based language models\u0026mdash;pre-trained on large corpora and fine-tuned for downstream tasks\u0026mdash;had become the default experimental setting for virtually every text-classification, retrieval, and question-answering task, including legal ones.\u003c/p\u003e \u003cp\u003eThe period under review thus spans a profound methodological transition, but one that has not simply displaced the earlier tradition. Instead, the field has come to recognise that legal applications impose requirements\u0026mdash;explicit justification, normative correctness, auditability, procedural fairness\u0026mdash;that purely statistical models cannot easily satisfy. The result is a growing body of hybrid work that combines the semantic power of large language models with the inferential precision of formal reasoners, knowledge graphs, and argumentation structures.\u003c/p\u003e \u003cp\u003eThis transition is further complicated by a rapidly evolving regulatory context. The European Union's Artificial Intelligence Act (Regulation (EU) 2024/1689), which entered into force on 1 August 2024, imposes a risk-based framework with tiered obligations for AI systems deployed in high-risk domains, including the administration of justice. Its phased applicability\u0026mdash;prohibitions and AI literacy requirements from February 2025, general-purpose AI model obligations from August 2025, and full applicability of core rules by August 2026\u0026mdash;means that legal researchers are simultaneously building the analytical tools needed to understand AI and becoming subject to the regulatory frameworks that govern it.\u003c/p\u003e \u003cp\u003eAgainst this backdrop, the present review pursues four interrelated objectives. First, it maps the thematic and methodological landscape of AI \u0026amp; Law research between 2011 and 2026, drawing primarily on publications in the Springer journal Artificial Intelligence and Law supplemented by major proceedings (ICAIL, JURIX) and benchmark-defining work from NLP venues. Second, it evaluates the methodological families\u0026mdash;symbolic, statistical, and hybrid\u0026mdash;against the specific demands that legal applications place on AI systems. Third, it analyses the benchmark and dataset infrastructure that has emerged, with attention to evaluation pitfalls that are specific to the legal domain. Fourth, it proposes a concrete research agenda oriented toward the regulatory, technical, and socio-legal challenges ahead.\u003c/p\u003e \u003cp\u003eThe remainder of this paper is structured as follows. Section \u003cspan refid=\"Sec2\" class=\"InternalRef\"\u003e2\u003c/span\u003e describes the review protocol, including search strategy, inclusion and exclusion criteria, and quality assessment. Section \u003cspan refid=\"Sec8\" class=\"InternalRef\"\u003e3\u003c/span\u003e presents the triadic taxonomy that structures our thematic synthesis. Section \u003cspan refid=\"Sec11\" class=\"InternalRef\"\u003e4\u003c/span\u003e provides an extended analysis of methodological families. Section \u003cspan refid=\"Sec15\" class=\"InternalRef\"\u003e5\u003c/span\u003e surveys benchmark datasets, evaluation metrics, and reproducibility challenges. Section \u003cspan refid=\"Sec19\" class=\"InternalRef\"\u003e6\u003c/span\u003e examines five application domains through representative case studies. Section \u003cspan refid=\"Sec25\" class=\"InternalRef\"\u003e7\u003c/span\u003e addresses ethical, regulatory, and socio-technical challenges, including a detailed engagement with the EU AI Act. Section \u003cspan refid=\"Sec29\" class=\"InternalRef\"\u003e8\u003c/span\u003e sets out the research agenda. Section \u003cspan refid=\"Sec35\" class=\"InternalRef\"\u003e9\u003c/span\u003e concludes.\u003c/p\u003e \u003cp\u003eFigure 1: Milestones in AI \u0026amp; Law Research (1986\u0026ndash;2027): Symbolic/Argumentation \u0026rarr; Legal NLP/ML \u0026rarr; Governance/Regulation\u003c/p\u003e \u003cp\u003e[Figure 1 \u0026ndash; Timeline to be rendered as diagram in final submission: key milestones include British Nationality Act as logic program (1986); ICAIL founding (1987); JURIX founding (1988); OWL/DL normative modelling (2014); argumentation schema tooling and ADF methodology for precedents (2016); Eunomos XML+ontology system (2016); CLAUDETTE ML system for unfair ToS clauses (2019); XAI via rationales in contract classification (2021); critique of court decision prediction task definitions (2022); transformer survey for LegalAI and ECHR argument mining corpus (2023); zero-shot legal QA on EU legislation and EU AI Act entry into force (2024); EU AI Act prohibitions and GPAI obligations (2025); EU AI Act full applicability (2026); transition deadline for certain high-risk product systems (2027).]\u003c/p\u003e"},{"header":"2. Review Protocol","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Review Type and Reporting Standard\u003c/h2\u003e \u003cp\u003eGiven the breadth of the topic\u0026mdash;spanning symbolic AI, machine learning, natural language processing, and AI governance\u0026mdash;this review is designed as a systematic scoping review following the PRISMA Extension for Scoping Reviews (PRISMA-ScR) checklist (Tricco et al., \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). Scoping reviews are appropriate when the aim is to map a broad evidence base, identify key concepts and research gaps, and synthesise diverse methodologies rather than to assess the effectiveness of a specific intervention. Where possible, we have incorporated elements of systematic review rigour: explicit search documentation, reproducible inclusion/exclusion criteria, and structured quality appraisal.\u003c/p\u003e \u003cp\u003eThe SALSA framework (Search, AppraisaL, Synthesis, Analysis) serves as our process model, enabling us to present a traceable sequence from database queries to final thematic synthesis, and to maintain a clear distinction between descriptive mapping and interpretive analysis.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2 Databases and Sources\u003c/h2\u003e \u003cp\u003eWe employed a multi-source search strategy reflecting the interdisciplinary character of AI \u0026amp; Law research. The primary database for journal articles was SpringerLink, with specific attention to the full archive of the journal Artificial Intelligence and Law. Scopus and Web of Science were queried for supplementary citation mapping and to identify highly cited works outside the primary journal. For legal NLP and benchmark-oriented work, the ACL Anthology provided access to computational linguistics proceedings; NeurIPS proceedings (Datasets and Benchmarks track) yielded key benchmark papers; and arXiv was consulted for recent preprints where peer-reviewed versions were unavailable.\u003c/p\u003e \u003cp\u003eICAIL (International Conference on AI and Law) and JURIX (International Conference on Legal Knowledge and Information Systems) proceedings were treated as primary sources for argumentation, case-based reasoning, normative logic, and legal information retrieval. COLIEE (Competition on Legal Information Extraction/Entailment) results and overview papers serve as a state-of-practice reference for legal IR and entailment tasks. Dutch policy and regulatory sources\u0026mdash;including publications from the Autoriteit Persoonsgegevens, the Raad voor de rechtspraak, and the Wetenschappelijke Raad voor het Regeringsbeleid\u0026mdash;contextualise national implementation of EU frameworks.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3 Search Terms and Temporal Scope\u003c/h2\u003e \u003cp\u003eThe temporal scope spans January 2011 to February 2026, with limited reference to pre-2011 seminal works (e.g., Sergot et al., \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e1986\u003c/span\u003e; the founding of ICAIL in 1987) for contextualisation only. An example search string, to be reproduced in the submission appendix, is as follows:\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e \u003cspan fontcategory=\"NonProportional\" class=\"\" name=\"Emphasis\"\u003e(\"Artificial Intelligence and Law\" OR \"AI and law\" OR \"computational legal reasoning\" OR \"legal argumentation\" OR \"legal ontology\" OR \"normative reasoning\" OR \"compliance checking\") AND (\"transformer\" OR \"BERT\" OR \"large language model\" OR \"legal NLP\" OR \"information retrieval\" OR \"textual entailment\" OR \"judgment prediction\" OR \"explainable AI\")\u003c/span\u003e\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eThis core string was adapted for each database according to its syntax conventions. Additional domain-specific terms\u0026mdash;'COLIEE', 'CUAD', 'LexGLUE', 'LegalBench', 'FOIA', 'compliance checking', 'case-based reasoning', 'defeasible logic'\u0026mdash;were appended iteratively as new relevant clusters emerged during screening.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.4 Inclusion and Exclusion Criteria\u003c/h2\u003e \u003cp\u003eInclusion criteria: (1) peer-reviewed articles or authoritative proceedings/benchmark papers published within 2011\u0026ndash;2026; (2) direct relevance to AI techniques applied to legal tasks, or to formal legal reasoning with computational models; (3) preferably, explicit problem definition, reproducible method or data, and formal evaluation. Exclusion criteria: (1) pure legal scholarship without a computational model or empirical evaluation; (2) tools or systems lacking sufficient methodological transparency, unless treated as 'industry practice' examples in a dedicated subsection; (3) works whose primary domain is adjacent (e.g., e-health decision support, general IR) without a substantive legal dimension.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e2.5 Screening and Appraisal\u003c/h2\u003e \u003cp\u003eTitles and abstracts were screened against inclusion criteria in a first pass. Full-text review was applied to all potentially relevant records. Quality appraisal focused on problem definition clarity, data provenance, evaluation design, and relevance to the triadic taxonomy developed in Section \u003cspan refid=\"Sec8\" class=\"InternalRef\"\u003e3\u003c/span\u003e. Given the scoping review design, low methodological quality was not grounds for exclusion but was noted in the synthesis as a limitation of specific clusters.\u003c/p\u003e \u003cp\u003e[Figure 2: PRISMA-ScR flow diagram\u0026mdash;records identified, screened, assessed for eligibility, and included\u0026mdash;to be inserted here in the final submission.]\u003c/p\u003e \u003c/div\u003e"},{"header":"3. Taxonomy and Thematic Synthesis","content":"\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e3.1 A Triadic Taxonomy of AI \u0026amp; Law Research\u003c/h2\u003e \u003cp\u003eA working taxonomy that is practically useful for readers of Artificial Intelligence and Law must capture both the historical profile of the field\u0026mdash;built on formal models, argumentation, and case-based reasoning\u0026mdash;and the present moment, in which transformer-based language models have become the default experimental setting and regulatory governance has emerged as a research object in its own right. We propose a triadic taxonomy centred on three thematic families:\u003c/p\u003e \u003cp\u003eText-Centric (Legal NLP): Research in this cluster treats legal language as its primary object, developing and evaluating models for classification, information retrieval, semantic entailment, question answering, summarisation, named entity recognition, and argument mining over legal corpora. The defining characteristic is that success is measured primarily through performance on textual tasks, often against benchmark datasets with standardised metrics.\u003c/p\u003e \u003cp\u003eReasoning-Centric (Legal Reasoning): Research in this cluster is concerned with the structure of legal inference: how normative rules interact, how exceptions and conflicts are resolved, how interpretive arguments are constructed, how precedents are analogised, and how probabilities and narratives combine in evidential reasoning. Methods range from defeasible logic and argumentation frameworks to Bayesian networks, case-based reasoning systems, and formal ontologies. The defining characteristic is explicit attention to inferential validity, not just predictive performance.\u003c/p\u003e \u003cp\u003eGovernance-Centric (AI in/over Law): Research in this cluster examines the conditions under which AI systems can be deployed justifiably in legal contexts, and the regulatory frameworks that govern such deployment. Topics include algorithmic fairness and bias, transparency and explainability, privacy and data governance, human oversight, and compliance with specific regulatory instruments such as the EU AI Act and the GDPR.\u003c/p\u003e \u003cp\u003eThese three families are not mutually exclusive: the most significant recent contributions tend to operate across boundaries, combining NLP with formal reasoning for interpretability, or embedding governance analysis within empirical studies of system performance. The triadic structure nonetheless offers a useful heuristic for mapping the literature and identifying under-served areas.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Method Families and Their Distribution\u003c/h2\u003e \u003cp\u003eAcross all three thematic families, the literature deploys methods from three broad methodological lineages: symbolic AI (logic, rules, argumentation, ontologies), statistical/ML AI (classical machine learning, deep learning, transformer language models), and hybrid AI (combinations that typically pair a statistical front-end for language understanding with a formal back-end for inference, constraint satisfaction, or audit logging). The distribution of these method families has shifted markedly across the period under review, with statistical and hybrid methods gaining ground while symbolic methods retain a strong presence in reasoning-centric and governance-centric research.\u003c/p\u003e \u003cp\u003e[Figure 3: 2D landscape map, Task family (text-centric \u0026harr; reasoning-centric) \u0026times; Model family (symbolic \u0026harr; statistical), with governance risk as a colour dimension\u0026mdash;to be rendered as a scatter/bubble plot in the final submission.]\u003c/p\u003e \u003c/div\u003e"},{"header":"4. Methodological Approaches","content":"\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e4.1 Symbolic and Formal Methods\u003c/h2\u003e \u003cp\u003eThe symbolic tradition in AI \u0026amp; Law rests on the insight that legal norms have a structure that can be captured by formal representations: they prescribe, prohibit, or permit actions; they apply conditionally; they can conflict; and they can be overridden by more specific or more recent norms. Deontic logic, defeasible reasoning, and argumentation frameworks are the primary tools for modelling this structure.\u003c/p\u003e \u003cp\u003eNormativity and Defeasibility: Legal systems are not monotonic\u0026mdash;new evidence, exceptions, and higher-level norms can defeat prima facie conclusions. Defeasible logics, in various forms, have been developed precisely to handle this feature. Compliance checking work by Robaldo et al. (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) demonstrates that propositional-level formalisms are often insufficient for realistic legal applications: compliance conditions apply to entire populations of individuals, making first-order representations necessary. Their comparative evaluation of multiple freely available reasoners\u0026mdash;including answer set programming (ASP) systems and OWL-based reasoners\u0026mdash;against a shared use case reveals a fundamental tension: reasoners with strong explainability properties tend to be computationally less efficient, while highly scalable systems such as ASP often lack built-in explanation mechanisms. This explainability-efficiency trade-off is one of the structural challenges the field has yet to fully resolve.\u003c/p\u003e \u003cp\u003eArgumentation Frameworks: Argumentation has long been recognised as particularly well-suited to legal reasoning because legal decisions characteristically involve weighing competing considerations rather than deriving conclusions from unambiguous premises. Walton, Sartor, and Macagno (\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) model statutory interpretation as a multi-scheme argumentation process that balances pro and contra arguments, providing both a logical formalisation of pro-tanto versus all-things-considered conclusions and tooling for visualisation and evaluation. Al-Abdulkarim, Atkinson, and Bench-Capon (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) develop a design methodology for case-based reasoning systems using Abstract Dialectical Frameworks (ADFs), drawing an illuminating analogy with entity-relationship modelling to emphasise the engineering discipline required in legal knowledge representation.\u003c/p\u003e \u003cp\u003eOntologies and Knowledge Representation: The Eunomos system (Boella et al., \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) combines XML-structured legal documents with ontological modelling to provide a web-based knowledge management platform for legislative information. Francesconi (\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2014\u003c/span\u003e) develops OWL Description Logic patterns for normative relations and integrates them with retrieval architectures, demonstrating that formal semantic representations can simultaneously support human browsing, formal querying, and inferential reasoning over legislation.\u003c/p\u003e \u003cp\u003eProbabilistic and Narrative Methods: Evidence in legal proceedings is characteristically uncertain and fragmentary. Vlek et al. (\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) develop a methodology for explaining Bayesian network models of legal evidence using structured scenarios, addressing a critical weakness of probabilistic models: their opacity to lay decision-makers. By providing scenario-based narratives as interfaces to underlying probability calculations, the approach bridges formal inference and human comprehension.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e4.2 Statistical and Machine Learning Methods\u003c/h2\u003e \u003cp\u003eThe statistical tradition in AI \u0026amp; Law gained momentum with the application of classical machine learning\u0026mdash;support vector machines, logistic regression, random forests\u0026mdash;to legal text classification tasks. The defining feature of this paradigm is that performance is measured empirically against labelled data rather than verified against formal specifications.\u003c/p\u003e \u003cp\u003eClassical Machine Learning: The CLAUDETTE system (Lippi et al., \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) exemplifies the classical ML approach applied to a practically significant legal problem: the automated detection of potentially unfair clauses in online Terms of Service agreements. Using supervised learning over a hand-annotated corpus, the system achieves competitive performance on clause classification and establishes a foundational benchmark for subsequent work. The choice of SVM-based models reflects the limited training data available\u0026mdash;a persistent constraint in legal NLP where expert annotation is expensive.\u003c/p\u003e \u003cp\u003eTransformer-Based Models: The arrival of BERT and its successors transformed legal NLP research with remarkable speed. Greco and Tagarelli (\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) provide a systematic survey of transformer-based language models applied to legal AI tasks, organising an extensive literature by task type and model architecture. Their survey is methodologically notable for its explicit attention to the relationship between model choice and task characteristics: not all legal NLP tasks benefit equally from pre-training on general corpora, and domain-specific pre-training on legal text (e.g., LegalBERT, CamemBERT trained on French legal corpora) yields consistent improvements on tasks requiring precise legal terminology or citation structure.\u003c/p\u003e \u003cp\u003eLarge Language Models and Zero-Shot Generalisation: The most recent cohort of large language models (LLMs)\u0026mdash;including GPT-4 class models and instruction-tuned variants of the LLaMA and Mistral families\u0026mdash;have introduced a new mode of system development: zero-shot or few-shot prompting without task-specific fine-tuning. Sovrano et al. (\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) investigate zero-shot legal question answering on European legislation using discourse-based retrieval selection (DiscoLQA), demonstrating that careful retrieval design\u0026mdash;exploiting the discourse structure of EU legislative texts\u0026mdash;can substantially reduce hallucination and improve answer precision even without fine-tuning. Their work illustrates a general principle: for LLM-based legal applications, the quality of the retrieval and grounding pipeline often matters more than raw model scale.\u003c/p\u003e \u003cp\u003eCourt Decision Prediction: A significant sub-literature has developed around the prediction of judicial outcomes. Medvedeva, Wieling, and Vols (\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) provide an important methodological intervention in this area, arguing that 'court decision prediction' conflates at least three distinct tasks that make very different demands on models and that have very different implications for legal practice: outcome identification (labelling an existing decision's outcome), judgement categorisation (classifying decisions by legal issue or ground), and outcome forecasting (predicting the outcome of a pending case). Their taxonomy has direct implications for benchmark design: datasets and evaluation metrics appropriate for one task may be systematically misleading for another. Benedetto et al. (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) extend decision-support in this domain by combining judgment prediction with sentence-level explanation, using legal named entity recognition masking and entity-aware transformer architectures to generate predictions that are traceable to specific evidentiary or normative elements of the case.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e4.3 Hybrid and Explainability-Oriented Methods\u003c/h2\u003e \u003cp\u003eHybrid approaches that combine statistical and symbolic components have attracted growing interest as the limitations of purely neural systems for legal applications have become apparent. The central motivation is that legal decisions must be justifiable\u0026mdash;they must be explainable in terms of legal norms, precedents, and facts\u0026mdash;not merely accurate by some external criterion.\u003c/p\u003e \u003cp\u003eRationale-Based Explanation: Ruggeri et al. (\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2022\u003c/span\u003e) develop a memory-augmented neural network architecture for detecting and explaining unfair clauses in consumer contracts. The system produces not just a classification but a legal rationale: a set of sentences or phrases that provide a human-readable justification for the prediction. Their evaluation demonstrates that rationale-based models can improve both predictive accuracy and explanation quality, and that users find rationale-equipped systems significantly more useful than those providing only labels.\u003c/p\u003e \u003cp\u003eXAI and the Black Box Problem: Brożek et al. (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) provide a philosophically rigorous analysis of the 'black box problem' as it manifests in legal AI applications. Rather than treating opacity as a single problem, they decompose it into four analytically distinct components: opacity (the model's internal workings are not accessible), strangeness (the model's decision-making process does not resemble recognisable human reasoning), unpredictability (the model's outputs cannot be reliably anticipated), and the justification gap (the model cannot provide a legally adequate rationale). This four-part analysis is valuable for researchers because it clarifies which component of the black box problem a given XAI technique actually addresses\u0026mdash;and which it leaves open.\u003c/p\u003e \u003cp\u003eHybrid Architectures for Compliance: The compliance checking literature exemplifies the hybrid approach at the architectural level. A promising direction combines LLM-based semantic extraction\u0026mdash;identifying the relevant facts and norms in natural language documents\u0026mdash;with formal constraint solvers that check compliance, detect conflicts, and generate audit logs. Robaldo et al.'s (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) comparative evaluation suggests that this architecture is technically feasible but that current technology fragmentation\u0026mdash;with multiple incompatible formalisms and tools\u0026mdash;is a significant obstacle to production deployment.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComparison of Methodological Families in AI \u0026amp; Law Research\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003e Method Family\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRepresentation\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLegal Strengths\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eWeaknesses/Risks\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eKey AIL Example\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRule/logic-based (symbolic)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDeontic rules, defeasibility, constraints\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eTransparent inference; explicit exceptions; compliance/norm conflicts\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eKnowledge engineering cost; scalability gaps\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eCompliance reasoner comparison on first-order use case (Robaldo et al., \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2023\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eArgumentation-based (symbolic)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eArgument schemes, pro/contra, acceptability\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eFits legal justification; interpretive and evidential reasoning\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eFormalisation choices affect outcomes; domain-specific evaluation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eStatutory interpretation as argumentation schemes (Walton et al., \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2016\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCase-based reasoning / precedent\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFactors, issues/values, case graphs\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMatches precedent dynamics and analogical reasoning\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eFactor representation is labour-intensive; limited cross-domain transfer\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eADF methodology for factor-based cases (Al-Abdulkarim et al., \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2016\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eProbabilistic\u0026thinsp;+\u0026thinsp;narrative\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBayesian networks\u0026thinsp;+\u0026thinsp;scenarios\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRigorous in evidential reasoning; scenario narratives aid comprehension\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eModelling expertise required; risk of model outsourcing\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eExplaining Bayesian networks via scenario schemes (Vlek et al., \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2016\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClassical ML (SVM, etc.)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFeature space\u0026thinsp;+\u0026thinsp;labels\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRobust with limited data; useful baseline for clause detection\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLimited semantic generalisation; shallow explanation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eCLAUDETTE: ML detection of unfair clauses (Lippi et al., \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2019\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDeep learning / transformers\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePre-training\u0026thinsp;+\u0026thinsp;fine-tuning / in-context\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eState-of-the-art on most legal NLP tasks; scalable\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eHallucination; justification gap; domain shift; governance demands\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eSystematic transformer review for LegalAI (Greco \u0026amp; Tagarelli, \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2024\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXAI / rationale-based\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRationales, sentence-level explanation, attribution\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eIncreases acceptability and auditability; connects to legal motivation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eExplanation may be cosmetic; requires human evaluation for validation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eMemory networks for ToS explanation (Ruggeri et al., \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2022\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKR / ontologies and semantic web\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRDF/OWL, XML, Hohfeld relations\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eStructural interoperability; formal querying; normative relation modelling\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eOntology engineering cost; mismatch with raw natural language; maintenance at law change\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eOWL-DL patterns for normative retrieval (Francesconi, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2014\u003c/span\u003e); Eunomos (Boella et al., \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2016\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"5. Data, Benchmarks, and Evaluation","content":"\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003e5.1 Why Legal Data is Different\u003c/h2\u003e \u003cp\u003eAI systems for legal applications work almost exclusively with language\u0026mdash;statutes, judgments, contracts, administrative decisions\u0026mdash;and with institutional context: procedural rules, jurisdictional constraints, norm hierarchies. These properties make the construction and use of legal datasets substantially more complex than in most general-purpose NLP applications.\u003c/p\u003e \u003cp\u003eFirst, labels in legal datasets are characteristically interpretive. Whether a clause in a contract is 'unfair,' whether a judicial outcome is 'predictable,' or whether a piece of text 'entails' a legal conclusion are questions that often admit of principled disagreement among experts\u0026mdash;and that disagreement is not noise but signal about the inherent interpretive structure of law. Several contributions in the AI \u0026amp; Law literature explicitly model annotator disagreement as a substantive finding rather than a problem to be minimised (Habernal et al., \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2024\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eSecond, legal datasets are typically jurisdiction-specific and language-specific in ways that reduce their generalisability. A dataset of European Court of Human Rights decisions is not straightforwardly applicable to common law contexts; a dataset of Dutch case law raises different challenges from one based on US federal contracts. This jurisdictional fragmentation is one of the most significant structural barriers to cumulative progress in the field.\u003c/p\u003e \u003cp\u003eThird, evaluation in legal AI frequently requires domain experts whose time is scarce and expensive. Standard automated metrics\u0026mdash;accuracy, F1, exact match\u0026mdash;may poorly capture legally relevant distinctions. A system that correctly predicts the outcome of a judgment but cannot produce a legally adequate rationale is of limited practical value; conversely, a system with lower predictive accuracy but richer explanatory output may be more useful to practitioners.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003e5.2 Key Benchmarks and Datasets\u003c/h2\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e summarises the principal benchmark datasets and evaluation resources that have shaped empirical research in the period under review.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eKey Benchmarks and Datasets in AI \u0026amp; Law Research (2011\u0026ndash;2026)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDataset / Benchmark\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTask Type\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDomain / Jurisdiction\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLang.\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eScale (indicative)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eTypical Metrics\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003ePrimary Source\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLexGLUE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNLU suite (multi-task)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDiverse legal NLU tasks\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eEN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eBundle of datasets; standardised evaluation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eAcc / F1 (task-dependent)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eChalkidis et al. (\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2022\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLegalBench\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLegal reasoning for LLMs\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBroad (162 tasks; 6 reasoning types)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eEN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e162 tasks; multi-LLM evaluation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eTask-specific; Acc / F1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eGuha et al. (\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2023\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCUAD\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eContract clause review (span/label)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCommercial contracts (EDGAR)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eEN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e13k+ annotations; 510 contracts\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eF1 / EM span scores; per-label Acc\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eHendrycks et al. (\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2021\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCOLIEE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLegal IR\u0026thinsp;+\u0026thinsp;entailment/QA competition\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCase law\u0026thinsp;+\u0026thinsp;statute law (JP/CA)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eEN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eAnnual; multiple tasks\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eMicro-F1 (retrieval); Acc per task\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eCOLIEE overview (10th edition)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eECHR Argument Mining Corpus\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eArgument mining\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEuropean Court of Human Rights\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eEN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e373 decisions; 2.3M tokens; 15k argument spans\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eSpan-F1\u0026thinsp;+\u0026thinsp;expert review\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eHabernal et al. (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2024\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNL Case-Law Citation Prediction\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eClassification / prediction\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDutch case law\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eJudgment-level dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eAcc / F1\u0026thinsp;+\u0026thinsp;error analysis\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eSchepers et al. (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2024\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCLAUDETTE / ToS Unfair Clauses\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eClause classification\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eConsumer law / Terms of Service\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eEN+ multilingual\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eMultiple clause categories\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eAcc / F1; explanation evaluation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eLippi et al. (\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2019\u003c/span\u003e); Ruggeri et al. (\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2022\u003c/span\u003e); Galassi et al. (\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2025\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFOIA Deliberative Language\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSensitive text detection\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eOpen-records / FOIA (US federal)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eEN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eNewly annotated training set; operational tests\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003ePrecision / Recall\u0026thinsp;+\u0026thinsp;usability\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eBranting et al. (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2025\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003e5.3 Evaluation Metrics and Their Limitations\u003c/h2\u003e \u003cp\u003eStandard NLP metrics\u0026mdash;accuracy, macro/micro F1, exact match, mean average precision\u0026mdash;are necessary but not sufficient for evaluating legal AI systems. Several systematic limitations recur across the literature.\u003c/p\u003e \u003cp\u003eAsymmetric error costs: In many legal applications, false negatives and false positives carry very different costs. A false negative in sensitive document redaction (a disclosed deliberative communication) may have serious legal consequences; a false positive (an unnecessarily redacted document) may frustrate transparency obligations. Standard F1 treats these symmetrically. Calibrated threshold selection and cost-sensitive evaluation are thus important complements to aggregate performance metrics.\u003c/p\u003e \u003cp\u003eJurisdiction and domain shift: Models trained on US contracts may perform poorly on EU equivalents due to differences in legal terminology, contractual structure, and applicable law. Evaluation protocols that report only in-domain test performance can systematically overstate practical utility. Cross-jurisdiction and cross-domain evaluation splits, where feasible, provide more informative performance estimates.\u003c/p\u003e \u003cp\u003eLabel disagreement: Multiple AI \u0026amp; Law benchmarks involve inherently contestable labels. The ECHR argument mining corpus (Habernal et al., \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) reports inter-annotator agreement statistics alongside model performance and uses expert review as an additional evaluation layer\u0026mdash;a methodological standard that should be widely adopted.\u003c/p\u003e \u003cp\u003eThe justification gap: Perhaps the most fundamental evaluation challenge in legal AI is the absence of standardised protocols for assessing the adequacy of system-generated explanations and justifications. Brożek et al. (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) frame this as the 'justification' component of the black box problem: even a system with high predictive accuracy and locally faithful post-hoc explanations may fail to provide a justification that would be acceptable in a legal proceeding. Developing 'justification benchmarks'\u0026mdash;evaluation tasks that require herleidbare arguments, normative citations, and structural reasoning\u0026mdash;is one of the most important open methodological challenges in the field.\u003c/p\u003e \u003c/div\u003e"},{"header":"6. Application Domains and Case Studies","content":"\u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003e6.1 Contract Analysis and Consumer Protection\u003c/h2\u003e \u003cp\u003eThe automated analysis of legal contracts\u0026mdash;particularly for detecting unfair, unusual, or high-risk clauses\u0026mdash;has emerged as one of the most productive application areas in AI \u0026amp; Law, combining clear practical value with tractable technical problem definitions.\u003c/p\u003e \u003cp\u003eThe CLAUDETTE project (Lippi et al., \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) established the foundational framework: a supervised machine learning system trained on annotated Terms of Service documents to detect clauses belonging to potentially unfair categories defined by the EU's Unfair Contract Terms Directive. The system demonstrated that ML-based clause detection was technically feasible and practically useful, and it established a benchmark dataset that has supported cumulative research.\u003c/p\u003e \u003cp\u003eThe subsequent work by Ruggeri et al. (\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2022\u003c/span\u003e) addressed a critical limitation of the original system: the absence of explanations for its predictions. Using memory-augmented neural networks that learn to attend to relevant 'legal rationale' passages, the extended system produces both a classification and a set of supporting text spans that justify it. Human evaluation confirmed that rationale-equipped systems are substantially more useful to non-expert users than those providing only labels, validating the investment in explanation infrastructure.\u003c/p\u003e \u003cp\u003eGalassi et al. (\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) extend the CLAUDETTE framework to a multilingual EU context, comparing approaches for detecting unfair clauses in Terms of Service documents across multiple European languages. Their findings confirm that multilingual robustness is non-trivial\u0026mdash;models trained primarily on English-language data show substantial performance degradation on less-resourced EU languages\u0026mdash;and that cross-lingual transfer requires careful attention to both model architecture and training data distribution.\u003c/p\u003e \u003cp\u003eFrom a socio-technical perspective, contract AI is not merely a technical problem. Consumer contracts are sites of power asymmetry: the companies that draft Terms of Service have legal teams; the consumers who agree to them typically do not. AI-based detection systems can support regulators and civil society organisations in identifying systematic unfairness at scale. However, deployment requires careful institutional design: who has access to the system, how false positive rates interact with regulatory burden, and how detection results feed into enforcement processes.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003e6.2 E-Discovery, Open Records, and Sensitive Text Management\u003c/h2\u003e \u003cp\u003eElectronic discovery and open-records management represent a second major application cluster, characterised by the need to classify large volumes of documents for legal sensitivity or privileged status under time pressure and with significant legal consequences for errors.\u003c/p\u003e \u003cp\u003eBranting et al. (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) describe a decision-support system for US Freedom of Information Act (FOIA) requests, designed to assist federal agency reviewers in identifying 'deliberative process' text\u0026mdash;internal communications about policy development that are subject to a FOIA exemption. The system was developed using a newly annotated training dataset, evaluated against domain expert annotations, and tested in an operational deployment context with federal agency staff.\u003c/p\u003e \u003cp\u003eMethodologically, the FOIA paper is exemplary for its integration of technical evaluation (precision/recall on the annotation task) with human-centred evaluation (usability studies examining time-to-decision, error rates, and reviewer confidence). It demonstrates that user interface design and workflow integration are not merely engineering considerations but substantive research contributions: a technically capable system that is not usable by its intended operators fails as a legal tool, regardless of its benchmark performance. The 'human-in-the-loop' evaluation framework used in this work should be regarded as a methodological standard for high-stakes legal AI applications.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003e6.3 Compliance Checking and Normative Reasoning\u003c/h2\u003e \u003cp\u003eCompliance checking\u0026mdash;determining whether a system of actions, contracts, or business processes satisfies a set of legal norms\u0026mdash;is perhaps the application domain that most directly showcases the strengths and limitations of formal AI methods.\u003c/p\u003e \u003cp\u003eRobaldo et al. (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) conduct a comparative evaluation of multiple freely available reasoning technologies on a compliance checking use case involving first-order knowledge and compensatory norms. Their evaluation is methodologically important because it treats the choice of reasoning technology as an open research question rather than a presupposition: different formalisms\u0026mdash;including OWL-based reasoners, ASP solvers, and Datalog variants\u0026mdash;make different trade-offs between expressiveness, computational efficiency, and explanation quality.\u003c/p\u003e \u003cp\u003eThe paper's central finding\u0026mdash;that explainability and efficiency are in tension, with ASP systems offering better scalability but limited explanation facilities while OWL-based systems provide richer explanations at the cost of performance\u0026mdash;articulates a structural constraint that hybrid architectures must navigate. An important implication is that compliance checking systems cannot be evaluated on technical correctness alone: the quality and accessibility of the explanations they generate are equally important criteria for legal deployability.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec23\" class=\"Section2\"\u003e \u003ch2\u003e6.4 Legal Interpretation, Argumentation, and Precedent\u003c/h2\u003e \u003cp\u003eLegal interpretation\u0026mdash;determining what a statutory or regulatory text means in a specific factual context\u0026mdash;is perhaps the most intellectually demanding task in legal practice, and arguably the one least amenable to purely statistical approaches.\u003c/p\u003e \u003cp\u003eWalton, Sartor, and Macagno (\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) model statutory interpretation as a multi-scheme argumentation process in which competing interpretive arguments\u0026mdash;grammatical, systematic, teleological, and analogical\u0026mdash;are weighed against each other using argument schemes from rhetoric and informal logic. Their formalisation provides both a theoretical framework for understanding interpretation debates and a computational architecture that supports tooling for visualisation and evaluation. The integration of pro-tanto and all-things-considered reasoning provides a nuanced treatment of how interpretive conclusions can be qualified by context.\u003c/p\u003e \u003cp\u003eFor precedent reasoning, Al-Abdulkarim, Atkinson, and Bench-Capon (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) develop a systematic methodology for designing case-based reasoning systems using Abstract Dialectical Frameworks. Their approach treats the design of a legal reasoning system as an engineering discipline with explicit design choices: which factors are modelled, how issues and values are related, and how the resulting system handles novel cases that do not map cleanly onto existing precedents. The analogy with entity-relationship modelling in database design is apt: legal knowledge representation requires the same combination of domain expertise, representational discipline, and iterative refinement that complex data modelling demands.\u003c/p\u003e \u003cp\u003eHabernal et al. (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) approach legal argumentation from the NLP direction, constructing a large annotated corpus of argument spans from European Court of Human Rights decisions. Their annotation schema is explicitly grounded in legal theory\u0026mdash;distinguishing conclusion, premise, and epistemic support in legally meaningful ways\u0026mdash;and their evaluation combines automated metrics with expert review. The paper establishes both a benchmark and a methodology that other argument mining researchers can adopt.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec24\" class=\"Section2\"\u003e \u003ch2\u003e6.5 Adjudication Support: Prediction, Explanation, and Citation\u003c/h2\u003e \u003cp\u003eResearch on AI support for judicial decision-making has grown substantially in the review period, fuelled both by increasing availability of digitised case law and by commercial interest in litigation prediction tools.\u003c/p\u003e \u003cp\u003eAs noted above, Medvedeva, Wieling, and Vols (\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) provide an essential methodological corrective by distinguishing three distinct tasks that are routinely conflated under the 'court decision prediction' label. Their taxonomy\u0026mdash;outcome identification, judgement categorisation, and outcome forecasting\u0026mdash;has significant implications for how benchmarks should be designed and how results should be interpreted. Outcome identification (labelling a known decision) and outcome forecasting (predicting an unknown pending decision) are not merely different in difficulty; they make fundamentally different demands on models and have entirely different practical implications.\u003c/p\u003e \u003cp\u003eBenedetto et al. (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) address the explanation gap in prediction-oriented research by combining judgment prediction with sentence-level explanation. Their entity-aware transformer architecture, which leverages legal named entity recognition to mask and attend to legally significant entities, produces predictions that are traceable to specific passages in the judgment. The approach also addresses privacy concerns by allowing selective anonymisation of legally sensitive entities.\u003c/p\u003e \u003cp\u003eSchepers et al. (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) investigate a distinct but practically important question in the Dutch legal context: given the increasing volume of published case law, can NLP models predict whether a judgment will subsequently be cited by other courts? Citation prediction is proposed as a proxy for 'legal authority'\u0026mdash;a judgment that is frequently cited by other courts has, in some sense, made law. The paper combines careful NLP methodology with legal domain analysis, situating citation patterns within the institutional structure of the Dutch court system.\u003c/p\u003e \u003c/div\u003e"},{"header":"7. Ethical, Regulatory, and Socio-Technical Challenges","content":"\u003cdiv id=\"Sec26\" class=\"Section2\"\u003e \u003ch2\u003e7.1 The EU AI Act and Its Implications for AI \u0026amp; Law Research\u003c/h2\u003e \u003cp\u003eThe EU Artificial Intelligence Act (Regulation (EU) 2024/1689), which entered into force on 1 August 2024, represents the most significant regulatory development for AI \u0026amp; Law research since the field's founding. Its risk-based framework classifies AI systems into prohibited, high-risk, and limited-risk categories, with 'administration of justice and democratic processes' explicitly listed as a high-risk domain.\u003c/p\u003e \u003cp\u003eFor AI systems deployed in this domain, the Act imposes obligations regarding data quality and governance, technical documentation, transparency and traceability, accuracy and robustness, and human oversight. These obligations are not merely compliance requirements for commercial vendors: they define a research agenda. Systems that cannot be documented, audited, or subjected to meaningful human oversight are, under the Act, legally non-deployable in high-risk legal contexts regardless of their technical performance. This creates a powerful institutional incentive for the kind of explainability, auditability, and formal verification research that the symbolic and hybrid traditions in AI \u0026amp; Law have long advocated.\u003c/p\u003e \u003cp\u003eThe phased implementation timeline is important for researchers to understand. The prohibitions on clearly unacceptable AI practices and AI literacy requirements entered into application in February 2025. Obligations for general-purpose AI (GPAI) models\u0026mdash;including foundation models used as components of legal AI systems\u0026mdash;became applicable in August 2025. The core obligations for high-risk AI systems will apply from August 2026, with a transitional period extending to August 2027 for certain high-risk product systems already on the market.\u003c/p\u003e \u003cp\u003eThe Act's transparency requirements are particularly relevant for the transformer-based systems that have become dominant in legal NLP. Systems that make automated decisions affecting individuals must be able to explain those decisions in terms that the affected persons can understand. Post-hoc explanations of the kind generated by SHAP or LIME-style methods may not satisfy this requirement if they cannot be connected to legally meaningful reasoning steps.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec27\" class=\"Section2\"\u003e \u003ch2\u003e7.2 Fairness, Discrimination, and Sensitive Personal Data\u003c/h2\u003e \u003cp\u003eThe use of personal data in AI-based legal decision-support raises both GDPR compliance issues and substantive fairness concerns. Žliobaitė and Custers (\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) surface a non-obvious tension: in some circumstances, using sensitive personal data (such as race or gender) in a decision model may be necessary to detect and correct for existing discrimination in training data. Excluding these variables without understanding the causal structure of the data generating process can preserve or amplify historical biases under a superficial appearance of fairness.\u003c/p\u003e \u003cp\u003eThe Dutch Data Protection Authority (Autoriteit Persoonsgegevens) has emphasised that all algorithmic processing of personal data must comply with GDPR requirements, including purpose limitation, data minimisation, transparency, and rights of data subjects. For legal AI researchers, this means that dataset construction, model training, and system deployment must be embedded in a privacy-by-design framework that goes beyond standard anonymisation procedures.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec28\" class=\"Section2\"\u003e \u003ch2\u003e7.3 Human Oversight and the Automation Paradox\u003c/h2\u003e \u003cp\u003eA recurring theme across AI \u0026amp; Law applications is the tension between efficiency gains from automation and the legal requirements for human accountability. Legal decisions that affect individuals must, in most jurisdictions, be made by or under the meaningful oversight of a responsible human decision-maker. AI systems that are too complex for their human operators to understand or monitor effectively may undermine this requirement even when they are formally embedded in human-in-the-loop workflows.\u003c/p\u003e \u003cp\u003eThe FOIA assistant evaluated by Branting et al. (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) illustrates this challenge: their usability studies reveal that reviewers often find it difficult to critically evaluate or override system recommendations for documents they have not independently reviewed. This 'automation paradox'\u0026mdash;where increasing system capability reduces the effective oversight that human operators exercise\u0026mdash;is a fundamental socio-technical challenge for high-stakes legal AI deployment, and one that purely technical research cannot resolve.\u003c/p\u003e \u003c/div\u003e"},{"header":"8. Research Agenda and Open Challenges","content":"\u003cp\u003eOur synthesis identifies five priority areas for AI \u0026amp; Law research in the coming period.\u003c/p\u003e \u003cdiv id=\"Sec30\" class=\"Section2\"\u003e \u003ch2\u003e8.1 From Accuracy to Legal Validity: Justification-Oriented Benchmarks\u003c/h2\u003e \u003cp\u003eCurrent legal AI benchmarks primarily reward correct predictions against labelled data. What they systematically fail to capture is whether a system's output is legally valid\u0026mdash;whether it respects the hierarchy of norms, correctly applies exceptions, provides adequate justification, and exhibits procedural fairness. We propose the development of 'justification benchmarks' that require systems to produce not just predictions but traceable justifications: citations to relevant norms and precedents, explicit handling of exceptions and conflicts, and structured arguments that follow legally recognised inference patterns. Such benchmarks would require collaboration between computational researchers and legal domain experts and would constitute a major methodological contribution to the field.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec31\" class=\"Section2\"\u003e \u003ch2\u003e8.2 Hybrid LLM-Plus-Formal-Constraint Architectures\u003c/h2\u003e \u003cp\u003eThe evidence reviewed suggests that neither purely neural nor purely symbolic approaches are sufficient for the full range of legal AI tasks. LLMs excel at natural language understanding and generation but lack reliable mechanisms for normative consistency or audit trail generation. Formal reasoners excel at constraint satisfaction and explanation generation but struggle with the ambiguity and variability of natural legal language. The most promising architectural direction combines these complementarities: LLMs as semantic front-ends for information extraction and hypothesis generation, coupled with formal constraint engines (ASP, SHACL, OWL-based reasoners, or normative constraint languages) for final inference, conflict detection, and audit logging. Realising this architecture requires advances in semantic parsing, uncertainty representation, and interoperability between NLP and formal reasoning components.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec32\" class=\"Section2\"\u003e \u003ch2\u003e8.3 Multilingual Benchmarks and Cross-Jurisdiction Transfer\u003c/h2\u003e \u003cp\u003eEU legal contexts require multilingual robustness across at least 24 official languages, yet the benchmark infrastructure for non-English legal NLP remains sparse. Cross-jurisdiction transfer between common law and civil law systems raises additional challenges related to different legal concepts, institutional structures, and interpretive traditions. A concrete research agenda would develop multilingual benchmarks with explicit juridical-comparative label definitions\u0026mdash;i.e., annotation guidelines that account for systematic differences in legal concepts across jurisdictions\u0026mdash;and would incorporate annotator disagreement modelling as a first-class component of the evaluation framework.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec33\" class=\"Section2\"\u003e \u003ch2\u003e8.4 Human-Centred Evaluation as a First-Class Method\u003c/h2\u003e \u003cp\u003eThe FOIA assistant work and the rationale-explanation literature both suggest that human-centred evaluation\u0026mdash;measuring actual impact on decision quality, workload, accuracy, and appropriate trust calibration\u0026mdash;is at least as important as automated metric performance for legal AI systems. We recommend that the field develop standardised protocols for human-in-the-loop evaluation of legal AI systems, including pre-registered study designs, standardised outcome measures (time-to-decision, error rate, correction behaviour, trust and confidence calibration), and guidelines for meaningful usability testing with domain experts.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec34\" class=\"Section2\"\u003e \u003ch2\u003e8.5 Open-Texture Detection and Regulatory Interpretation\u003c/h2\u003e \u003cp\u003eOpen texture\u0026mdash;the phenomenon by which apparently clear legal terms acquire indeterminate applications at the margins\u0026mdash;is a fundamental feature of law that any system intended to support regulatory interpretation must handle. Several AI \u0026amp; Law contributions have proposed that detecting open-texture terms is a necessary prerequisite for reliable regulatory automation. LLMs may offer new capabilities in this area, given their ability to generate multiple candidate interpretations, but evaluation requires legally grounded annotation: which terms are 'open texture,' in what contexts, and with what degree of interpretive controversy. A dataset of open-texture terms annotated by legal experts, with accompanying explanation requirements, would be a valuable community resource.\u003c/p\u003e \u003c/div\u003e"},{"header":"9. Conclusion","content":"\u003cp\u003eThis systematic scoping review has mapped fifteen years of research at the intersection of Artificial Intelligence and Law, from 2011 to 2026. The overarching narrative is one of productive but unresolved tension between two paradigms: the formal, symbol-based tradition that prizes explicability, normative correctness, and procedural legitimacy; and the statistical, data-driven tradition that prizes empirical performance, scalability, and practical usability.\u003c/p\u003e \u003cp\u003eThe period under review has seen remarkable progress on both fronts. Symbolic methods have become more computationally tractable, more amenable to integration with ontological and knowledge graph infrastructure, and more sophisticated in their treatment of argumentation, defeasibility, and uncertainty. Statistical methods have been transformed by the pre-training paradigm, with transformer-based language models achieving human-competitive performance on many legal NLP benchmarks. Hybrid architectures are emerging that seek to combine the complementary strengths of both traditions.\u003c/p\u003e \u003cp\u003eAt the same time, several structural challenges persist. The gap between predictive accuracy and legal justifiability remains wide and is not bridged by current XAI techniques. Benchmark infrastructure, though greatly improved, remains concentrated in English-language, common law contexts and does not adequately capture the juridical validity of system outputs. Human-centred evaluation is underused relative to its importance for high-stakes legal applications. And the regulatory context\u0026mdash;particularly the EU AI Act's explicit requirements for high-risk AI systems in the justice domain\u0026mdash;is reshaping the design space for legal AI in ways that amplify the need for exactly the formal verification, documentation, and human oversight capabilities that the field's founding tradition championed.\u003c/p\u003e \u003cp\u003eThe research agenda we have proposed\u0026mdash;justification benchmarks, hybrid architectures, multilingual transfer, human-centred protocols, and open-texture detection\u0026mdash;is ambitious but not unrealistic. Crucially, it requires collaboration between computational researchers, legal scholars, and practitioners that goes beyond the surface-level interdisciplinarity of importing legal data into NLP pipelines. Addressing the deepest challenges of AI \u0026amp; Law requires bringing legal reasoning\u0026mdash;in its full interpretive, argumentative, and institutional complexity\u0026mdash;to the centre of the technical research agenda.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eConflict of Interest\u003c/strong\u003e: The author declares no conflicts of interest.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Availability\u003c/strong\u003e: No new datasets were generated or analysed for this review. All datasets referenced are described in the relevant primary sources cited in the reference list.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAI Tool Usage Disclosure\u003c/strong\u003e: Consistent with Springer policy, large language model assistance was used for editing support; the intellectual content, analysis, synthesis, and all substantive claims are solely the work of the named author.\u003c/p\u003e\n\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eA systematic scoping review (2011\u0026ndash;Feb 2026) of AI \u0026amp; Law research following a PRISMA-ScR\u0026ndash;informed protocolA triadic taxonomy of the field (text-centric / reasoning-centric / governance-centric) and mapping of methods (symbolic / statistical / hybrid), benchmarks, and application domainsA synthesis arguing the field has moved from \u0026ldquo;AI or Law\u0026rdquo; toward hybrid socio-technical systems balancing legal justification/traceability with empirical performance/reproducibilityA research agenda with five open challenges (justification benchmarks; LLM+formal constraints; multilingual/cross-jurisdiction transfer; human-centred evaluation; open-texture detection)\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eAl-Abdulkarim L, Atkinson K, Bench-Capon T (2016) A methodology for designing systems to reason with legal cases using Abstract Dialectical Frameworks. Artif Intell Law 24:1\u0026ndash;49\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBenedetto I, Koudounas A, Vaiani L et al (2025) Boosting court judgment prediction and explanation using legal entities. Artif Intell Law 33:605\u0026ndash;640\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBoella G, Di Caro L, Humphreys L et al (2016) Eunomos, a legal document and knowledge management system for the Web to provide relevant, reliable and up-to-date information on the law. Artif Intell Law 24:245\u0026ndash;283\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBranting K, Brown B, Giannella C et al (2025) Decision support for detecting sensitive text in government records. Artif Intell Law 33:171\u0026ndash;197\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBrożek B, Furman M, Jakubiec M, Kucharzyk B (2024) The black box problem revisited. Real and imaginary challenges for automated legal decision making. Artif Intell Law 32:427\u0026ndash;440\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChalkidis I, Jana A, Hartung D, Bommarito M, Androutsopoulos I, Katz DM, Aletras N (2022) LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. Proceedings of ACL 2022\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eEuropean Commission (2024), 1 August AI Act enters into force\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eEuropean Union (2024) Regulation (EU) 2024/1689 of the European Parliament and of the Council (Artificial Intelligence Act). Official Journal of the European Union\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFrancesconi E (2014) A description logic framework for advanced accessing and reasoning over normative provisions. Artif Intell Law 22:291\u0026ndash;311\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGalassi A, Lagioia F, Jabłonowska A et al (2025) Unfair clause detection in terms of service across multiple languages. Artif Intell Law 33:641\u0026ndash;689\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGrant MJ, Booth A (2009) A typology of reviews: an analysis of 14 review types and associated methodologies. Health Inf Libr J 26(2):91\u0026ndash;108\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGreco CM, Tagarelli A (2024) Bringing order into the realm of Transformer-based language models for artificial intelligence and law. Artif Intell Law 32:863\u0026ndash;1010\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGuha N, Nyarko J, Ho DE et al (2023) LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. Proceedings of NeurIPS 2023 (Datasets and Benchmarks Track)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHabernal I, Faber D, Recchia N et al (2024) Mining legal arguments in court decisions. Artif Intell Law 32:1\u0026ndash;38\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHendrycks D, Burns C, Chen A, Ball S (2021) CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review. Proceedings of NeurIPS 2021 (Datasets and Benchmarks Track)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLippi M, Pałka P, Contissa G et al (2019) CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service. Artif Intell Law 27:117\u0026ndash;139\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMedvedeva M, Wieling M, Vols M (2023) Rethinking the field of automatic prediction of court decisions. Artif Intell Law 31:195\u0026ndash;212\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePage MJ, McKenzie JE, Bossuyt PM et al (2021) The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 372:n71\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRobaldo L, Batsakis S, Calegari R et al (2023) Compliance checking on first-order knowledge with conflicting and compensatory norms: a comparison among currently available technologies. Artificial Intelligence and Law\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRuggeri F, Lagioia F, Lippi M, Torroni P (2022) Detecting and explaining unfairness in consumer contracts through memory networks. Artif Intell Law 30:59\u0026ndash;92\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSchepers I, Medvedeva M, Bruijn M, Wieling M, Vols M (2024) Predicting citations in Dutch case law with natural language processing. Artif Intell Law 32:807\u0026ndash;837\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSergot MJ, Sadri F, Kowalski RA, Kriwaczek F, Hammond P, Cory HT (1986) The British Nationality Act as a logic program. Commun ACM 29(5):370\u0026ndash;386\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSovrano F, Palmirani M, Sapienza S, Pistone V (2025) DiscoLQA: zero-shot discourse-based legal question answering on European Legislation. Artif Intell Law 33:323\u0026ndash;359\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTricco AC, Lillie E, Zarin W et al (2018) PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann Intern Med 169(7):467\u0026ndash;473\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eVlek CS, Prakken H, Renooij S, Verheij B (2016) A method for explaining Bayesian networks for legal evidence with scenarios. Artificial Intelligence and Law\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWalton D, Sartor G, Macagno F (2016) An argumentation framework for contested cases of statutory interpretation. Artif Intell Law 24:51\u0026ndash;91\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eŽliobaitė I, Custers B (2016) Using sensitive personal data may be necessary for avoiding discrimination in data-driven decision models. Artif Intell Law 24:183\u0026ndash;201\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAutoriteit Persoonsgegevens (2025) Regels bij gebruik van AI \u0026amp; algoritmes (AVG-kader)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAutoriteit Persoonsgegevens (2026) Visie op generatieve AI: zonder duidelijke waarden dreigt 'Wilde Westen'\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRaad voor de rechtspraak (2019\u0026ndash;2025). Rechtstreeks (themanummer: Algoritmes in de rechtspraak); AI voor een rechtvaardige Rechtspraak (AI-strategie)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWetenschappelijke Raad voor het Regeringsbeleid (2021) Artifici\u0026euml;le Intelligentie \u0026ndash; adviesprojecten en documenten\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Artificial Intelligence and Law, legal NLP, normative reasoning, explainable AI, legal benchmarks, EU AI Act, argument mining, transformer models, compliance checking, scoping review","lastPublishedDoi":"10.21203/rs.3.rs-8913025/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8913025/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThis systematic scoping review examines how research at the intersection of Artificial Intelligence and Law (AI \u0026amp; Law) has evolved over the fifteen-year period from 2011 to 2026. Following a PRISMA-ScR-informed protocol, we synthesise contributions published primarily in Artificial Intelligence and Law and related venues across two converging paradigms: (i) symbolic, argumentation-based, and formal models for legal knowledge representation, normative reasoning, and justification, and (ii) statistical, machine learning, and natural language processing (NLP) approaches that analyse, predict, and retrieve legal text at scale.\u003c/p\u003e \u003cp\u003eOur core finding is that the field has transitioned from a dichotomy of 'AI or Law' toward hybrid socio-technical systems in which formal guarantees\u0026mdash;normative consistency, traceability, and human oversight\u0026mdash;must coexist with empirical performance demands such as robust generalisation, reproducibility, and realistic task evaluation. Methodologically, a clear shift from relatively closed, domain-specific systems toward open benchmarks, open data, and open implementations is observable, particularly in legal NLP and legal information retrieval/entailment competitions. Yet a crucial distinction persists: the difference between 'predicting correctly' and 'reasoning legally.' Multiple contributions emphasise that predictive models without adequate explanation and justification frameworks remain legally and socially problematic.\u003c/p\u003e \u003cp\u003eWe operationalise a triadic taxonomy\u0026mdash;text-centric, reasoning-centric, and governance-centric\u0026mdash;and map representative works onto method families (symbolic, statistical, hybrid), datasets and benchmarks, and application domains (contract analysis, e-discovery, compliance checking, adjudication support, and argument mining). The EU AI Act's risk-based framework, with phased applicability through 2026\u0026ndash;2027, directly amplifies research questions around transparency, documentation, human oversight, and data quality. We conclude with a concrete research agenda identifying five open challenges: justification-oriented benchmark design, hybrid LLM-plus-formal-constraint architectures, multilingual and cross-jurisdiction transfer, human-centred evaluation protocols, and open-texture detection in regulatory text.\u003c/p\u003e","manuscriptTitle":"Artificial Intelligence and Law, 2011–2026: A Systematic Scoping Review of Methods, Benchmarks, and Open Challenges","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-20 04:59:34","doi":"10.21203/rs.3.rs-8913025/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"68586660-a35b-4eee-935c-c0e862c6baea","owner":[],"postedDate":"February 20th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-02-23T05:25:35+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-20 04:59:34","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8913025","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8913025","identity":"rs-8913025","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00