AI-Powered Defect Prediction: From Code Smells to Failure Forecasting | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article AI-Powered Defect Prediction: From Code Smells to Failure Forecasting Md Mostafizur Rahman, Md Mostafijur Rahman, Maria Khatun Shuvra, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6792823/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Manual flaw discovery becomes even more insufficient as software systems grow in complexity. From early signs like code smells to late-stage system failures, this systematic study investigates the use of artificial intelligence (AI) and machine learning (ML) approaches to anticipate software problems across several phases of the software lifetime. 500 peer-reviewed papers released between 2013 and 2025 were examined for techniques, datasets, and assessment measures using PRISMA guidelines. Important artificial intelligence models consist in Random Forest, SVM, deep learning architectures, and new transformer models. Features applied span static measurements, process-based indications, and textual data from code repositories. The paper exposes a developing tendency toward hybrid models, multimodal features, and an emphasis on explainability and cross-project adaptability. Generalizability, interpretability, and dataset consistency still present difficulties notwithstanding advances. Research gaps are highlighted in the study together with future prospects including explainable artificial intelligence, real-time CI/CD integration, and human-in- the-loop systems for strong and proactive software quality assurance. AI-powered defect prediction code smells bug forecasting failure prediction machine learning software quality software metrics static and dynamic analysis deep learning explainable AI Figures Figure 1 Figure 2 Figure 3 Figure 4 1. INTRODUCTION Usually beyond the capacity and resources of most development teams are the ways in which flaws are introduced into code and the sheer number of flaws in software. Defect prediction methods seek to find software artifacts most likely to be faulty (1). By giving developers' attention on particular artifacts such as commits, methods, or classes top priority, defect prediction mostly helps to lower the cost of testing, analysis, and code reviews (2). Since it directly helps to improve software quality, lower maintenance costs, and speed development cycles, defect prediction is rather important in software engineering (3). Identifying components or modules likely to have flaws before they are used helps developers allocate testing and debugging resources more wisely, therefore reducing possible failures that can cause system collapses, security vulnerabilities, or performance problems (4). Manual code inspection becomes unworkable in large-scale, sophisticated software systems, and undetectable flaws can cause significant financial and reputation harm. Early and automatic detection of high-risk areas based on historical data, code metrics, or even developer behavior is made possible by defect prediction models especially those driven by artificial intelligence and machine learning (5). This proactive technique turns the emphasis from reactive debugging to preventive quality assurance, hence producing more dependable, maintainable, and secure software products. Software keeps underlining important infrastructure and services; so, the capacity to forecast and prevent flaws is not only helpful but also necessary to provide strong, high-performance systems (6). Within the field of software engineering, a number of important words are crucial for grasp of defect prediction (7). Code smells are structural elements in the source code that, although not always bad or flawed, show underlying design or implementation problems that might cause flaws over time. Examples that undermine code readability and maintainability are too long approaches, duplicated code, or classes handling several roles. Conversely, bugs errors or weaknesses in the software that lead to unexpected or improper behavior often result from logical errors, erroneous presumptions, or inadvertent component interactions (8). Usually resulting from underlying defects, failures in a software system are little annoyances or major system crashes that compromise the intended operation of the system during operation. Technical debt is the total cost of poor design decisions and hasty fixes that give short-term aims top priority over long-term code quality, therefore impeding future progress and complicating defect fixing. These ideas taken together provide the basis of defect prediction since they underline the several causes and expressions of software quality degradation (9). 1.1. Rise of AI/ML in automating defect prediction: With automated, data-driven solutions to a typically human and error-prone process, the rise of artificial intelligence (AI) and machine learning (ML) has fundamentally changed the terrain of defect prediction in software engineering. Conventional techniques of flaw detection such as static code analysis and human code reviews strugg to keep pace with the volume and complexity of modern codebases as software systems get more sophisticated and large (10–13). By learning from past software data including code metrics, modification histories, and bug reports, AI/ML methods solve these problems by identifying patterns and signs linked with flaws. Models including decision trees, support vector machines, neural networks, and more lately, deep learning architectures, have shown amazing capacity in very accurate prediction of problematic components. These instruments improve early fault discovery as well as offer scalable, ongoing evaluation all through the software development process (14–16). Moreover, developments in graph-based models and natural language processing have made it possible to analyze unstructured data, including dependency graphs and commit messages, thereby enhancing fault detection capacities. Consequently, using intelligent automation, artificial intelligence and machine learning has become a vital friend in reaching better software quality and dependability (17–21). With an eye on bridging early-stage signs like code smells to end-stage manifestations like system failures, the scope of this systematic review covers the whole spectrum of artificial intelligence-powered defect prediction. This review stresses their interconnection and evolution during the program lifetime, even while most of the current research views code abnormalities, defects, and failures as isolated events. We want to give a whole picture of defect development by looking at how early warning flags in the codebase such as bad design patterns or maintainability issues can escalate into functional flaws and finally operational breakdowns. From development to deployment, this integrated approach helps to find predictive elements at several phases and assesses how artificial intelligence and machine learning methods might be applied to foresee problems at each level. By doing this, the study not only emphasizes the requirement of thorough, multilayered methods that solve flaws before them impact system dependability or performance but also the efficiency of individual prediction models. Table 1 Research Questions For this review Research questions include: 1) For defect prediction, which artificial intelligence methods apply? 2) How could software flaws relate to code smells? 3) Which common datasets, measurements, and tools exist? 2. Study Design Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) criteria, this systematic review guarantees scientific rigor, openness, and repeatability. Including early-stage indicators (e.g., code smells), mid-stage anomalies (e.g., bugs), and late-stage operational difficulties (e.g., system failures), the main goal was to fully assess and synthesize AI-driven solutions for defect prediction across the software lifecycle. 2.1. Search Strategy: To guarantee thorough covering of peer-reviewed literature and high-quality research outputs, a multi-database search was carried out. Covering the years January 2013 through April 2025, database searches included IEEE Xplore, ACM Digital Library, SpringerLink, ScienceDirect, Scopus, and Web of Science. Carefully created using Boolean operators, search queries contained combinations of the following keywords: "Artificial Intelligence," OR "AI," "Machine Learning," OR "Deep Learning," AND "Failure Prediction," OR "Bug Forecasting," OR "Defect Prediction AND "Code Smells" OR "Software Quality" OR "Technical Debt." For backward snowballing, we also looked over the references of a few chosen papers; for forward snowballing, we employed citation tracking on Google Scholars. Study Covered Areas Zhao et al. (22) 2000–2021 (67 Studies) Pachouly et al. (23) 2010–2021 (146 studies) Li et al. (24) 2014–2017 (64 studies) Hosseini et al. (25) –2015 (68 studies) Li et al. (25) 2010–2018 (75 studies) Batool and Khan (26) 2010–2021 (80 studies) 2.2. Exclusion and Inclusion Standards: Articles were chosen according on the following standards in order to guarantee relevance and quality: Empirical investigations using artificial intelligence/machine learning to predict software flaws Papers devoid of artificial intelligence or machine learning approaches Publications covering 2013 to 2025 Non-peer-reviewed materials (like blogs or white papers) English-language magazines just Studies providing evaluation outcomes using publicly available or industry datasets in languages other than English Review books, position papers, or studies lacking empirical support. Coverings of at least one stage: codes smell, bugs, vulnerabilities, or failure events, studies concentrating exclusively on hardware or non-software-related defect identification. 2.3. Screening and Selection Procedure: After duplicate elimination, 3,418 papers were first accessed. Two reviewers separately conducted title and abstract screening using Rayyan QCRI; next was full-text screening. A third reviewer fixed conflicts. Following a thorough eligibility check, the final review comprised 500 items in all. Figure 1 shows the PRISMA flow diagram depicting this process; it will be included in the book. 2.4. Data Gathering: Development of a structured data extraction form guarantees consistency among reviewers. Data for every chosen article were gathered: Title, authors, year, venue—bibliographic information, Artificial Intelligence/ML Techniques Algorithm type, model complexity, training technique used Prediction Scope: Code smells, bugs, security flaws, running-through problems. Static measurements, dynamic execution logs, textual data, change history in feature types. Name of the dataset, source public or private, size, domain. Accuracy, precision, recall, F1-score, AUC-ROC, MCC: evaluation metrics K-fold CV, train/test split, cross-project validation formative strategy Weka, Scikit-learn, TensorFlow, Code2Vec, SonarQube, etc. tools and frameworks 2.5. Evaluation of Quality: Every study was assessed using a modified form of the Critical Appraisal Skills Program (CASP) checklist developed for software engineering in order to evaluate methodological quality and reduce bias. Among the criteria were experimental rigor, dataset availability, repeatability, and algorithm transparency. Experiments were scored as High, Medium, or Low quality; sensitivity experiments were carried out to evaluate how lower-quality research would affect overall results. 2.6. Method of Data Synthesis: A narrative synthesis was used considering the variances in studies in terms of approaches, datasets, and results. Research was arranged thematically by model type traditional ML against deep learning and by prediction focus e.g., code smell detection, bug prediction, failure forecasting. Variations in experimental design and incompatible evaluation criteria prevented quantitative synthesis (meta-analysis). Descriptive statistics and qualitative coding helped to spot trends, popular methods, and research holes. 3. Results 3.1. Defect Prediction: AI Models: The review turned up a wide spectrum of artificial intelligence and machine learning models used in the literature. With 130 investigations, Random Forest which boasts interpretability, robustness against overfitting, and simplicity of use emerged as the most often used model. Not far behind with 85 and 110 studies respectively are Support Vector Machines (SVM) and Neural Networks. There were sixty articles on traditional decision trees, mostly from past editions. Often used in failure prediction and log data analysis, deep learning models like CNNs and RNNs showed up in 45 and 30 research respectively. Though relatively new in this sector, transformers were utilized in 40 research showing encouraging results in natural language-based fault prediction from code comments and commit messages. The bar chart below shows this distribution, stressing in recent years a move toward more intricate models. 3.2. Defect Prediction Features Type: Most research heavily rely on feature engineering. Applied in 220 papers, the most often employed features were Static Metrics including cyclomatic complexity, coupling, cohesiveness, and lines of code. 160 papers highlighted process metrics like code churn, commit frequency, and developer experience. Found in 100 investigations were textual features obtained from commit messages, bug reports, and inline comments. Just 50 studies mostly for failure prediction utilizing temporal deep learning models used Log Data. Fascinatingly, 120 research used combined feature sets where textual, dynamic, and static data were merged that reflected a trend toward multi-modal learning. The follwoing chart shows this distribution and implies that better prediction accuracy could result from richer feature sets. 3.3. Validation Methodologies and Evaluation Metrics: Classification performance was assessed using several criteria in evaluation of defect prediction models. Used in 480 out of 500 research, accuracy was the most often cited statistic. Because they helped to address class imbalance, F1-score and Precision were also rather common, mentioned in 410 and 370 research respectively. In 360 studies, recall was emphasized particularly in safety-critical applications where identifying all possible flaws is absolutely vital. With MCP gaining popularity for its fair assessment of binary classification tasks, AUC-ROC and Matthews Correlation Coefficient (MCC) showed up in 290 and 120 studies respectively. Most studies used k-fold cross-valuation or train/test splits, with a minority utilizing time-series validation for log-based forecasting; these metrics were sometimes used in concert. Emphasizing the community's dependence on standard classification measures, below chart offers a graphic summary of metric use. 4. AI-Based Defect Prediction's Taxonomy From early code anomalies to system breakdowns, AI-powered defect prediction encompasses several phases of the software development lifecycle. Three basic categories Code Smell Detection, Bug and Vulnerability Prediction, and Failure Forecasting separate the taxonomy presented here. As seen in the chart below, every category makes use of different artificial intelligence techniques catered to its detection objectives. 4.1. Code Smell Detection : Code smells are signs of bad design decisions that might not always result in bugs but rather generate maintainability problems and finally software defects. Typical forms are feature envy, god classes, and long methods. Using artificial intelligence models including Transformers, Convolutional Neural Networks (CNNs), and Decision Trees, these odors from stationary code metrics and source patterns have been identified. Though their scalability makes static approaches more common, both static analysis source code scanning and dynamic analysis runtime behavior inspection are applied. Table 2 Code Smell Detection (Decision Tree) from sklearn.tree import Decision Tree Classifier X = [[45, 10, 3], [10, 2, 1], [100, 20, 5]] # Features: length, complexity, loops y = [1, 0, 1] # 1 = Smell, 0 = Clean model = DecisionTreeClassifier() model.fit(X, y) new_method = [[60, 12, 2]] prediction = model.predict(new_method) print("Smelly" if prediction[0] else "Clean") 4.2. Bug and Vulnerability Forecast : This domain is dedicated to spotting possibly defective code sections before they cause running problems. Usually employed are artificial intelligence models include Extreme Gradient Boosting (XGBoost), Random Forests, and Support Vector Machines (SVM). Features taken from past software repositories including commit messages, change frequency, and developer behavior form the basis of training for these models Deep learning models are used more and more because they can learn from unstructured data including natural language documentation and code text. Table 3 Bug Prediction (TF-IDF + Random Forest) from sklearn.ensemble import RandomForestClassifier from sklearn.feature_extraction.text import TfidfVectorizer commit_logs = [ "Fixed null pointer exception in user login", "Refactored main activity", "Added new features to payment module" ] labels = [1, 0, 1] # 1 = Bug-prone, 0 = Safe vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(commit_logs) model = RandomForestClassifier() model.fit(X, labels) new_commit = vectorizer.transform(["Removed redundant code"]) print("Bug-prone" if model.predict(new_commit)[0] else "Safe") 4.3. Predictive Failure: Failure forecasting uses temporal artificial intelligence models to predict system-level breakdowns outside of code-level problems. Analyzing time-series data from system logs, performance measures, and error traces, techniques like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks These models provide proactive maintenance and more software dependability by allowing real-time monitoring, which forecasts production environment faults before they start. Table 4 Failure Forecasting (LSTM) import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import LSTM, Dense import numpy as np X = np.random.rand(100, 10, 3) # 100 sequences, 10 timesteps, 3 features y = np.random.randint(0, 2, size=(100, 1)) # 0 = No Failure, 1 = Failure model = Sequential([ LSTM(32, input_shape=(10, 3)), Dense(1, activation='sigmoid') ]) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) model.fit(X, y, epochs = 10, verbose = 0) sample_log = np.random.rand(1, 10, 3) print("Failure Risk" if model.predict(sample_log)[0][0] > 0.5 else "Stable") 5. Datasets and Tools The quality and diversity of datasets utilized during model training and evaluation greatly affect the effectiveness and generalizability of artificial intelligence-powered defect prediction systems. The literature has made use of a large spectrum of both private and public datasets to enable code smell detection, bug prediction, and failure prediction. Among the most often utilized are the PROMISE repository datasets, which include open-source project historical software metrics and defect classifications. Other well-known datasets are GitHub-based datasets mined using tools like PyDriller or GHTorrent to extract real-world code metrics, commit history, and issue tracking data; NASA's MDP statistics, which include module-level defect data from spaceship software systems, also fit here. These databases frequently contain many software engineering metrics like Halstead complexity metrics, Chidamber and Kemerer (CK) object-oriented metrics (e.g., WMC, DIT, NOC), and process metrics (e.g., number of commits, churn rate). More recently, natural language elements extracted from commit messages, pull request conversations, and issue descriptions have also been included into prediction models, therefore allowing a fuller, context-aware depiction of software changes. To address the difficulty of data imbalance and label shortage, some researchers have gone a step further creating synthetic datasets utilizing mutation testing or simulation environments. Many machine learning systems and software quality tools are regularly used in order to process and examine these datasets. Originally a mainstay of defect prediction studies, WEKA offers a simple interface for conventional classifiers including Naive Bayes, SVMs, and Decision Trees. Researchers mostly use TensorFlow, Keras, or PyTorch—which provide flexibility in designing neural network architectures tailored to code representations, such as token sequences, abstract syntax trees (ASTs), or vector embeddings from models like Code2Vec and CodeBERT—deep learning-based approaches. Moreover, a consistent supply of smell-related characteristics, tools like SonarQube and PMD are frequently used to discover code smells in Java and other languages by static code analysis (27–36). There are still various problems even with the profusion of tools and datasets. Many datasets concentrate mostly on Java, C, and C + + projects, therefore limiting in terms of language variation. Reproducibility and cross-project comparison are further seriously hampered by labeling discrepancies, changing program structures, and non-standard data formatting. Moreover, industrial datasets even if more realistic of real-world systems are sometimes inaccessible because of secrecy, therefore restricting the relevance of research results to open-source environments exclusively. Advancement of the field of artificial intelligence-driven software defect prediction depends on addressing these constraints by improved benchmark datasets, uniform labeling, and cooperative sharing programs (33). 6. Evaluation Metrics and Validation Approaches An important factor determining the dependability and efficiency of AI-powered defect prediction models in practical software engineering environments is their evaluation. Many measures have been used in the literature to reflect several aspects of model performance, especially in the context of imbalanced datasets where faulty modules sometimes reflect a minority class. Precision, recall, F1-score, accuracy, and Area Under the Receiver Operating Characteristically Curve (AUC-ROC) are the most often utilized measures. While recall evaluates the model's capacity to properly identify all real defect-prone modules, precision gauges the percentage of expected defect-prone modules that are actually defective. By offering a harmonic mean, the F1-Score strikes a balance between these two—especially helpful when the cost of false positives and false negatives has to be taken equally into account. Conversely, AUC-ROC presents a strong indication for unbalanced datasets since it assesses the general classification capacity of the model independent of class distribution. Considered a fair measure even if class sizes vary greatly, the Matthews Correlation Coefficient (MCC) is another crucial but less often mentioned statistic that considers genuine and erroneous positives and negatives. MCC is especially important in jobs involving software defect prediction where large skewness between defective and clean modules can skew traditional measures like accuracy. To handle related problems several researches additionally apply G-means, Balanced Accuracy, or Kappa statistics. Researchers often use methods including k-fold cross-valuation, in which the dataset is split into k subsets and the model is trained and tested k times, each time using a different subset as the test set and the remaining as training data, so verifying model performance and ensuring generalizability. This approach offers a more constant assessment of model performance and lowers the variance related with random train-test splits. Hold-out validation e.g., 70 − 30 or 80 − 20 splits—is speedier and employed in preliminary investigations but, depending on the split ratio and data distribution, is more prone to bias. In circumstances when datasets are tiny or very skewed, stratified k-fold cross-valuation (LOOCV) and leave-one-out cross-valuation (LOOCV) are also applied. These methods guarantee that minority classes are fairly represented during training and testing by helping each fold to preserve the ratio of defective to non-defective cases. Some studies additionally include time-aware validation, particularly in failure predicting, to replicate real-world deployment situations when training takes place on past data and testing on future modules. Another necessary element of validation are comparative baselines. Often used to show improvement, artificial intelligence-based models are tested against conventional statistical techniques as logistic regression or decision trees. Comparisons between several families of artificial intelligence models—such as traditional machine learning (e.g., SVM, Random Forest) and deep learning (e.g., CNN, LSTM)—are also offered to benchmark developments in recent research. Furthermore, ablation experiments are carried out to investigate the effects of particular preprocessing or individual characteristics, so improving model design for improved defect prediction. Building trust in AI-based defect prediction technologies depends on thorough examination using extensive metrics and validation techniques generally. It guarantees not just academic excellence but also pragmatic relevance in actual pipelines for quality assurance and software maintenance (35, 37–39). 7. Key Findings and Trends Over the field of software engineering, the methodical study of AI-based defect prediction methods exposes various newly developing trends and discoveries. Particularly when used to intricate representations of source code such abstract syntax trees (ASTs), graph-based structures, or embedded code sequences, one notable result is the growing efficacy of deep learning methods. Models such Graph Neural Networks (GNNs) and Long Short-Term Memory (LSTM) have shown better performance in catching both syntactic and semantic patterns connected with flaws. When paired with well-engineered stationary code metrics or historical commit data, simpler machine learning models as Random Forests, Support Vector Machines (SVM), and Gradient Boosting Trees still continue to perform competitively. From standalone code scent detection to more integrated approaches linking odors with historical bug data, code churn, and change frequency, there is a clear trend. Studies reveal that although not all code smells cause problems, some high-severity smells—such as God Class, Feature Envy—are statistically more linked with fault-prone components. More robust prediction findings usually come from hybrid approaches combining structural code smells with process measurements (such as the number of changes, developer activity, or issue tracker logs). An other important realization is the significance of feature quality above mere model complexity. Many high-performance studies underline the importance of domain-specific data retrieved from source code, like natural language features generated from comments and commit messages or object-oriented metrics (e.g., coupling, cohesion, inheritance depth). These characteristics are generally better than generic ones, and their careful choosing greatly increases prediction accuracy. Actually, especially in smaller datasets where overfitting still a problem, feature engineering is often found to be the decisive component in the success of a model. Furthermore, especially in industrial uses, the assessment notes a growing need for model interpretability and explainability. If the forecasts are opaque, developers sometimes object to using artificial intelligence methods. Recent studies have thus begun include visuals, SHAP values, or attention methods highlighting which code areas or attributes most affect the choice of the model. This increases developer confidence and favors practical insights over black-box alarms. Finally, in defect prediction, cross-project learning and transfer learning are clearly trending. Dataset bias and codebase-specific traits cause many models trained on a particular project to struggle to generalize. Some research investigate domain adaption methods and multi-task learning systems using data from several projects, hence enhancing generalizability. These methods are especially helpful in real-world situations when, for smaller or newer projects, labeled fault data is rare. These results taken together highlight the change from conventional static analysis to AI-enhanced, context-aware, and developer-friendly defect prediction systems, therefore advancing academic research as well as practical software quality assurance (34, 40–44). 8. Future Research Directions Although defect prediction driven by artificial intelligence has made great strides, some research prospects still remain unexplored. Development and integration of explainable artificial intelligence (XAI) approaches into defect prediction models is one of the most urgent directions. Most high-performance machine learning models, especially deep learning architectures, are naturally black-box in nature and so their acceptance in important software development processes is limited. Whether a code segment is defective or not, developers sometimes need reasonable justification behind a forecast. By offering human-understandable insights into how predictions are generated, XAI systems can improve trust, responsibility, and decision-making and so close the gap between model performance and developer usability. Transfer learning and domain adaptability across software projects present even another exciting avenue. Usually developed on a particular dataset, most artificial intelligence models for defect prediction struggle to generalize when implemented to other software repositories. Variations in code structure, development techniques, and project-specific traits produce this restriction. Future research should look at how models trained on one project may be adapted or fine-tuned with minimal data from another. Particularly when combined with fine-tuning systems, pretrained language models for source code—such as CodeBERT or GraphCodeBERT—offer fascinating chances for cross-project learning. Another important area of development is the integration of AI-based defect prediction systems inside pipelines of Continuous Integration and Continuous Deployment (CI/CD). Most current research see prediction as a stand-alone process; nonetheless, incorporating these predictions into real-time software development environments could greatly improve proactive debugging and quality assurance processes. Future studies could concentrate on creating lightweight, real-time prediction algorithms that automatically evaluate commit or pull requests and notify developers about possible hazards without appreciably raising the overhead of the pipeline. Future research could also investigate multimodal defect prediction by aggregating several kinds of input data—source code, commit messages, developer comments, execution logs, even test case results. Such vast and varied data could increase the context-awareness and resilience of defect prediction models. Combining natural language processing with static code analysis, for instance, might find semantic flaws or misaligned developer intents more successfully than either approach used alone. Finally, the idea of human-in---the-loop systems is developing as a fresh hybrid method whereby artificial intelligence supports rather than replaces human judgment. These systems give developers feedback, let them interact with forecasts, and over time they let them refine models. Such configurations might provide a continuous learning loop between the artificial intelligence system and its users, aid to reduce false positives, and enable adaptation to changing codesbases. Including feedback systems into predictive instruments guarantees their development in line with the software projects they support, therefore producing more accurate and environmentally friendly defect forecasting systems. Conclusions Rising to meet the increasing demand for scalable, accurate, proactive quality assurance methods, AI-powered defect prediction has become a transforming tool in modern software engineering. With a total synthesis of 500 peer-reviewed publications spanning more than a decade, this systematic analysis highlights the development of approaches and the growing complexity of predictive models from traditional algorithms to deep learning and transformer-based architectures. While classic models like Random Forest and SVM remain extensively used due to their robustness and interpretability, more recent developments in deep learning—including CNNs, RNNs, and transformer models—offer significant advantages, particularly in handling unstructured data including commit messages, source code comments, and system logs. The value of feature selection and representation runs across the book as a reoccurring topic. Often more important on model performance than the complexity of the learning method itself are rich, domain-specific characteristics include code smells, object-oriented metrics, change history, and semantic information. Research increasingly support hybrid feature sets combining textual data, process metrics, and static code analysis to show that a multimodal method can more precisely reflect the several character of software faults. Furthermore, integrated and longitudinal models—which link early-stage indicators (such as code smells) with mid- and late-stage results—such as bugs and runtime failures—are clearly trending. This lifecycle-aware approach improves the capacity of AI systems to forecast not only single mistakes but possible cascades of failures, hence facilitating more strategic allocation of testing and maintenance activities. Still, numerous difficulties exist notwithstanding these developments. Dataset bias and coding standards still cause many models to suffer with generalizability across projects. Furthermore, the opacity of intricate models—especially deep learning architectures—causes questions regarding interpretability, which is essential for developer confidence and acceptance in practical environments. Especially in industrial settings where data is often proprietary, the lack of high-quality, consistent, and varied datasets also compromises repeatability and broad applicability. Declarations Author Contributions All authors reviewed and approved the final version of this manuscript. Ethics Statement No ethical issues decalred. Conflict of Interest The authors declare no conflicts of interest relevant to this systematic analysis. Funding No external funding was used to support the development of this manuscript. Data Availability Declaration The data supporting the findings of this study are derived from publicly available datasets commonly used in software defect prediction research, including but not limited to the PROMISE repository, NASA MDP datasets, and GitHub-based repositories obtained using tools such as PyDriller and GHTorrent. Detailed information regarding dataset names, domains, and sources is included within the manuscript. No new data were created or proprietary data used for this study. Therefore, data sharing is not applicable to this article. All data utilized in the review are available from the cited references. References Agrawal A, Menzies T (2018) Is “better data” better than “better data miners”?: on the benefits of tuning SMOTE for defect prediction. In: Proceedings of the 40th international conference on software engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, pp 1050–1061. Ahluwalia A, Falessi D, Di Penta M (2019) Snoring: a noise in defect prediction datasets. In: Storey MD, Adams B, Haiduc S (eds) Proceedings of the 16th international conference on mining software repositories, MSR 2019, 26-27 May 2019. https://doi.org/10.1109/MSR.2019.00019. IEEE / ACM, Canada, pp 63–67. Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009) Fair and balanced?: Bias in bug-fix datasets. In: Proceedings of the the 7th Joint meeting of the european software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC/FSE ’09. https://doi.org/10.1145/1595696.1595716. ACM, New York, pp 121–130. Chen J, Shang W, Shihab E (2020) PerfJIT: Test-level just-in-time prediction for performance regression introducing commits. IEEE Trans Softw Eng :1–1. https://doi.org/10.1109/TSE.2020.3023955. Chen TH, Nagappan M, Shihab E, Hassan AE (2014) An empirical study of dormant bugs. Proceedings of the 11th Working Conference on Mining Software Repositories - MSR 2014. https://doi.org/10.1145/2597073.2597108. Chi J, Honda K, Washizaki H, Fukazawa Y, Munakata K, Morita S, Uehara T, Yamamoto R (2017) Defect analysis and prediction by applying the multistage software reliability growth model. In: IWESEP. IEEE Computer Society, pp 7–11. Cogo FR, Oliva GA, Hassan AE (2019) An empirical study of dependency downgrades in the NPM ecosystem. IEEE Trans Softw Eng :1–1. https://doi.org/10.1109/TSE.2019.2952130. Dalla Palma S, Di Nucci D, Palomba F, Tamburri DA (2021) Within-project defect prediction of infrastructure-as-code using product and process metrics. IEEE Trans Softw Eng :1–1. https://doi.org/10.1109/TSE.2021.3051492. Demiröz G, Güvenir HA (1997) Classification by voting feature intervals. In: ECML, Springer, Lecture Notes in Computer Science, vol 1224, pp 85–92. Fan Y, Xia X, Alencar da Costa D, Lo D, Hassan AE, Li S (2019) The impact of changes mislabeled by SZZ on just-in-time defect prediction. IEEE Trans Softw Eng :1–1. https://doi.org/10.1109/TSE.2019.2929761. Fukushima T, Kamei Y, McIntosh S, Yamashita K, Ubayashi N (2014) An empirical study of just-in-time defect prediction using cross-project models. In: Proceedings of the 11th working conference on mining software repositories, pp 172–181. Ghotra B, McIntosh S, Hassan AE (2017) A large-scale study of the impact of feature selection techniques on defect classification models. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR). IEEE, pp 146–157. Giger E, D’Ambros M, Pinzger M, Gall H (2012) Method-level bug prediction. pp 171–180. https://doi.org/10.1145/2372251.2372285. Herbold S (2019) On the costs and profit of software defect prediction. arXiv:1911.04309. Kamei Y, Shihab E (2016) Defect prediction: Accomplishments and future challenges. In: 2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), vol 5. IEEE, pp 33–45. Mende T, Koschke R (2010) Effort-aware defect prediction models. In: Capilla R, Ferenc R, Dueñas JC (eds) 14th European conference on software maintenance and reengineering, CSMR 2010, 15-18 March 2010. IEEE Computer Society, Spain, pp 107–116, DOI https://doi.org/10.1109/CSMR.2010.18, (to appear in print). Amasaki S. Cross-version defect prediction: use historical data, cross-project data, or both? Empir Softw Eng. 2020;25. Bangash AA, Sahar H, Hindle A, Ali K. On the time-based conclusion stability of cross-project defect prediction models. Empir Softw Eng. 2020;25. Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S. MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Software Eng. 2018;44. Bennin KE, Keung JW, Monden A. On the relative value of data resampling approaches for software defect prediction. Empir Softw Eng. 2019;24. Falessi D, Ahluwalia A, Penta MD. The impact of dormant defects on defect prediction: A study of 19 apache projects. ACM Trans Softw Eng Methodol (TOSEM). 2021;31. Zhao Y, Damevski K, Chen H. A systematic survey of just-in-time software defect prediction. ACM Computing Surveys. 2023;55(10):1-35. Pachouly J, Ahirrao S, Kotecha K, Selvachandran G, Abraham A. A systematic literature review on software defect prediction using artificial intelligence: Datasets, Data Validation Methods, Approaches, and Tools. Engineering Applications of Artificial Intelligence. 2022;111:104773. Li Z, Jing X-Y, Zhu X. Progress on approaches to software defect prediction. Iet Software. 2018;12(3):161-75. Li N, Shepperd M, Guo Y. A systematic review of unsupervised learning techniques for software defect prediction. Information and Software Technology. 2020;122:106287. Batool I, Khan TA. Software fault prediction using data mining, machine learning and deep learning techniques: A systematic literature review. Computers and Electrical Engineering. 2022;100:107886. Zhao L, Shang Z, Zhao L, Zhang T, Tang YY. Software defect prediction via cost-sensitive Siamese parallel fully-connected neural networks. Neurocomputing. 2019;352. Xia X, Lo D, Pan SJ, Nagappan N, Wang X. Hydra: massively compositional model for cross-project defect prediction. IEEE Trans Softw Eng. 2016;42. Tong H, Liu B, Wang S. Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf Softw Technol. 2018;96. Liang H, Yu Y, Jiang L, Xie Z. Seml: a semantic LSTM model for software defect prediction. IEEE Access. 2019;7. Miholca DL, Czibula G, Czibula IG. A novel approach for software defect prediction through hybridizing gradual relational association rules with artificial neural networks. Inf Sci. 2018;441. Manjula C, Florence L. Deep neural network based hybrid approach for software defect prediction using software metrics. Clust Comput. 2019;22. Jayanthi R, Florence L. Software defect prediction techniques using metrics based on neural network classifier. Clust Comput. 2019;22. Khan MZ. Hybrid ensemble learning technique for software defect prediction. Int J Modern Educ Comput Sci. 2020;12. Majd A, Vahidi-Asl M, Khalilian A, Poorsarvi-Tehrani P, Haghighi H. SLDeep: statement-level software defect prediction using deep-learning model on static code features. Expert Syst Appl. 2020;147. Akour M, Melhem WY. Software defect prediction using genetic programming and neural networks. Int J Open Sour Softw Process. 2017;8. Ali MM, Huda S, Abawajy J. A parallel framework for software defect detection and metric selection on cloud computing. Clust Comput. 2017;20. Khleel NAA, Nehéz K. A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method. J Intell Inf Syst. 2023;60. Khleel NAA, Nehéz K. A new approach to software defect prediction based on convolutional neural network and bidirectional long short-term memory. Prod Syst Inf Eng. 2022;10. Mohammed B, Awan I, Ugail H. Failure prediction using machine learning in a virtualised HPC system and application. Clust Comput. 2019;22. Anbu M, Anandha Mala GS. Feature selection using firefly algorithm in software defect prediction. Clust Comput. 2019;22. Pan C, Lu M, Xu B. An improved CNN model for within-project software defect prediction. Appl Sci. 2019;9. Chen L, Fang B, Shang Z, Tang Y. Negative samples reduction in cross-company software defects prediction. Inf Softw Technol. 2015;62. Feng S, Keung J, Yu X, Xiao Y, Zhang M. Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Inf Softw Technol. 2021;139. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6792823","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":467894516,"identity":"ba6442ba-de7d-46e7-9b9c-7847b68a9770","order_by":0,"name":"Md Mostafizur Rahman","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAArUlEQVRIiWNgGAWjYPACGyBmbDxAipY0kJYGkrQcBpPEaeHvP3x0c2XOebu17YeBttTYRBPUInEjLe3m2W23k7edSQRqOZaW20BIi4EEj9nNRqAWswNALYwNh4nQwn/+G1DLuWSz8w+J1cKQwwbUcsDO7AaxtgD9AnJYcoLZDaAtCcT4BRhiz4Ba7OzNzqc/fPChxoawFhhIBKtMIFY5CNiTongUjIJRMApGGAAA+MlKOR1C1n0AAAAASUVORK5CYII=","orcid":"","institution":"Westcliff University","correspondingAuthor":true,"prefix":"","firstName":"Md","middleName":"Mostafizur","lastName":"Rahman","suffix":""},{"id":467894517,"identity":"7df2e029-46c0-4b51-a695-5c2ddc6faef3","order_by":1,"name":"Md Mostafijur Rahman","email":"","orcid":"","institution":"Rajshahi University of Engineering \u0026 Technology(RUET)","correspondingAuthor":false,"prefix":"","firstName":"Md","middleName":"Mostafijur","lastName":"Rahman","suffix":""},{"id":467894518,"identity":"56ddd82d-d139-4ef3-9a35-d12731cd83fc","order_by":2,"name":"Maria Khatun Shuvra","email":"","orcid":"","institution":"Grand Canyon University","correspondingAuthor":false,"prefix":"","firstName":"Maria","middleName":"Khatun","lastName":"Shuvra","suffix":""},{"id":467894520,"identity":"de439f40-a379-4856-a92f-c1c311cf0b31","order_by":3,"name":"Md Mashfiquer Rahman","email":"","orcid":"","institution":"Louisiana State University","correspondingAuthor":false,"prefix":"","firstName":"Md","middleName":"Mashfiquer","lastName":"Rahman","suffix":""},{"id":467894522,"identity":"a3167c28-f0d1-4bc7-b48b-00fead31db89","order_by":4,"name":"Najmul Gony Md","email":"","orcid":"","institution":"Grand Canyon University","correspondingAuthor":false,"prefix":"","firstName":"Najmul","middleName":"","lastName":"Gony","suffix":"Md"}],"badges":[],"createdAt":"2025-05-31 20:23:11","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6792823/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6792823/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":84182281,"identity":"d6ab74d3-aed2-4a2d-9951-b04955894865","added_by":"auto","created_at":"2025-06-09 04:14:31","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":53744,"visible":true,"origin":"","legend":"\u003cp\u003eProportion Of Feature Types Used In Defect Prediction\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-6792823/v1/58377d9392d31b44a6541e00.png"},{"id":84182287,"identity":"9ef2d315-0109-404d-9067-772de4d3f3b9","added_by":"auto","created_at":"2025-06-09 04:14:31","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":48028,"visible":true,"origin":"","legend":"\u003cp\u003eEvaluation Metrics Frequency In Reviewed Studies\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-6792823/v1/7a00875b1fc8bd68a43e7ff1.png"},{"id":84182282,"identity":"c06f6e74-2681-4964-a1db-c1ceb8e73f94","added_by":"auto","created_at":"2025-06-09 04:14:31","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":361335,"visible":true,"origin":"","legend":"\u003cp\u003eDatasets and tools\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-6792823/v1/e7d8ca953752c9e2d6acdb87.png"},{"id":84182291,"identity":"f16be046-5e87-4779-8571-735c81a8babb","added_by":"auto","created_at":"2025-06-09 04:14:32","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":34069,"visible":true,"origin":"","legend":"\u003cp\u003eProportion Of Evaluation Metrics Handling Class Imbalance\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-6792823/v1/883b49ce7c6a46fcf03762e9.png"},{"id":84183264,"identity":"5798e5be-ed51-45f6-a2c5-d9204775db0b","added_by":"auto","created_at":"2025-06-09 04:38:37","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1074222,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6792823/v1/8edf93d6-4150-4a9b-bb7c-5847e49a65d9.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"AI-Powered Defect Prediction: From Code Smells to Failure Forecasting","fulltext":[{"header":"1. INTRODUCTION","content":"\u003cp\u003eUsually beyond the capacity and resources of most development teams are the ways in which flaws are introduced into code and the sheer number of flaws in software. Defect prediction methods seek to find software artifacts most likely to be faulty (1). By giving developers' attention on particular artifacts such as commits, methods, or classes top priority, defect prediction mostly helps to lower the cost of testing, analysis, and code reviews (2).\u003c/p\u003e \u003cp\u003eSince it directly helps to improve software quality, lower maintenance costs, and speed development cycles, defect prediction is rather important in software engineering (3). Identifying components or modules likely to have flaws before they are used helps developers allocate testing and debugging resources more wisely, therefore reducing possible failures that can cause system collapses, security vulnerabilities, or performance problems (4). Manual code inspection becomes unworkable in large-scale, sophisticated software systems, and undetectable flaws can cause significant financial and reputation harm. Early and automatic detection of high-risk areas based on historical data, code metrics, or even developer behavior is made possible by defect prediction models especially those driven by artificial intelligence and machine learning (5). This proactive technique turns the emphasis from reactive debugging to preventive quality assurance, hence producing more dependable, maintainable, and secure software products. Software keeps underlining important infrastructure and services; so, the capacity to forecast and prevent flaws is not only helpful but also necessary to provide strong, high-performance systems (6).\u003c/p\u003e \u003cp\u003eWithin the field of software engineering, a number of important words are crucial for grasp of defect prediction (7). Code smells are structural elements in the source code that, although not always bad or flawed, show underlying design or implementation problems that might cause flaws over time. Examples that undermine code readability and maintainability are too long approaches, duplicated code, or classes handling several roles. Conversely, bugs errors or weaknesses in the software that lead to unexpected or improper behavior often result from logical errors, erroneous presumptions, or inadvertent component interactions (8). Usually resulting from underlying defects, failures in a software system are little annoyances or major system crashes that compromise the intended operation of the system during operation. Technical debt is the total cost of poor design decisions and hasty fixes that give short-term aims top priority over long-term code quality, therefore impeding future progress and complicating defect fixing. These ideas taken together provide the basis of defect prediction since they underline the several causes and expressions of software quality degradation (9).\u003c/p\u003e \u003cdiv id=\"Sec2\" class=\"Section2\"\u003e \u003ch2\u003e1.1. Rise of AI/ML in automating defect prediction:\u003c/h2\u003e \u003cp\u003eWith automated, data-driven solutions to a typically human and error-prone process, the rise of artificial intelligence (AI) and machine learning (ML) has fundamentally changed the terrain of defect prediction in software engineering. Conventional techniques of flaw detection such as static code analysis and human code reviews strugg to keep pace with the volume and complexity of modern codebases as software systems get more sophisticated and large (10\u0026ndash;13). By learning from past software data including code metrics, modification histories, and bug reports, AI/ML methods solve these problems by identifying patterns and signs linked with flaws. Models including decision trees, support vector machines, neural networks, and more lately, deep learning architectures, have shown amazing capacity in very accurate prediction of problematic components. These instruments improve early fault discovery as well as offer scalable, ongoing evaluation all through the software development process (14\u0026ndash;16). Moreover, developments in graph-based models and natural language processing have made it possible to analyze unstructured data, including dependency graphs and commit messages, thereby enhancing fault detection capacities. Consequently, using intelligent automation, artificial intelligence and machine learning has become a vital friend in reaching better software quality and dependability (17\u0026ndash;21).\u003c/p\u003e \u003cp\u003eWith an eye on bridging early-stage signs like code smells to end-stage manifestations like system failures, the scope of this systematic review covers the whole spectrum of artificial intelligence-powered defect prediction. This review stresses their interconnection and evolution during the program lifetime, even while most of the current research views code abnormalities, defects, and failures as isolated events. We want to give a whole picture of defect development by looking at how early warning flags in the codebase such as bad design patterns or maintainability issues can escalate into functional flaws and finally operational breakdowns. From development to deployment, this integrated approach helps to find predictive elements at several phases and assesses how artificial intelligence and machine learning methods might be applied to foresee problems at each level. By doing this, the study not only emphasizes the requirement of thorough, multilayered methods that solve flaws before them impact system dependability or performance but also the efficiency of individual prediction models.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eResearch Questions For this review\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"1\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eResearch questions include:\u003c/p\u003e \u003cp\u003e1) For defect prediction, which artificial intelligence methods apply?\u003c/p\u003e \u003cp\u003e2) How could software flaws relate to code smells?\u003c/p\u003e \u003cp\u003e3) Which common datasets, measurements, and tools exist?\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"2. Study Design","content":"\u003cp\u003eFollowing the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) criteria, this systematic review guarantees scientific rigor, openness, and repeatability. Including early-stage indicators (e.g., code smells), mid-stage anomalies (e.g., bugs), and late-stage operational difficulties (e.g., system failures), the main goal was to fully assess and synthesize AI-driven solutions for defect prediction across the software lifecycle.\u003c/p\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.1. Search Strategy:\u003c/h2\u003e \u003cp\u003eTo guarantee thorough covering of peer-reviewed literature and high-quality research outputs, a multi-database search was carried out. Covering the years January 2013 through April 2025, database searches included IEEE Xplore, ACM Digital Library, SpringerLink, ScienceDirect, Scopus, and Web of Science. Carefully created using Boolean operators, search queries contained combinations of the following keywords:\u003cem\u003e\"Artificial Intelligence,\" OR \"AI,\" \"Machine Learning,\" OR \"Deep Learning,\" AND \"Failure Prediction,\" OR \"Bug Forecasting,\" OR \"Defect Prediction AND \"Code Smells\" OR \"Software Quality\" OR \"Technical Debt.\"\u003c/em\u003e\u003c/p\u003e \u003cp\u003eFor backward snowballing, we also looked over the references of a few chosen papers; for forward snowballing, we employed citation tracking on Google Scholars.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"No\" id=\"Taba\" border=\"1\"\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStudy\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCovered Areas\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eZhao et al.\u003c/b\u003e\u003c/p\u003e \u003cp\u003e(22)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2000\u0026ndash;2021 (67 Studies)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePachouly et al. (23)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2010\u0026ndash;2021 (146 studies)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLi et al. (24)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2014\u0026ndash;2017 (64 studies)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHosseini et al. (25)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u0026ndash;2015 (68 studies)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLi et al. (25)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2010\u0026ndash;2018 (75 studies)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBatool and Khan (26)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2010\u0026ndash;2021 (80 studies)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.2. Exclusion and Inclusion Standards:\u003c/h2\u003e \u003cp\u003eArticles were chosen according on the following standards in order to guarantee relevance and quality:\u003c/p\u003e \u003cp\u003eEmpirical investigations using artificial intelligence/machine learning to predict software flaws Papers devoid of artificial intelligence or machine learning approaches\u003c/p\u003e \u003cp\u003ePublications covering 2013 to 2025 Non-peer-reviewed materials (like blogs or white papers)\u003c/p\u003e \u003cp\u003eEnglish-language magazines just Studies providing evaluation outcomes using publicly available or industry datasets in languages other than English Review books, position papers, or studies lacking empirical support.\u003c/p\u003e \u003cp\u003eCoverings of at least one stage: codes smell, bugs, vulnerabilities, or failure events, studies concentrating exclusively on hardware or non-software-related defect identification.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.3. Screening and Selection Procedure:\u003c/h2\u003e \u003cp\u003eAfter duplicate elimination, 3,418 papers were first accessed. Two reviewers separately conducted title and abstract screening using Rayyan QCRI; next was full-text screening. A third reviewer fixed conflicts. Following a thorough eligibility check, the final review comprised 500 items in all. Figure\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e shows the PRISMA flow diagram depicting this process; it will be included in the book.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e2.4. Data Gathering:\u003c/h2\u003e \u003cp\u003eDevelopment of a structured data extraction form guarantees consistency among reviewers. Data for every chosen article were gathered: Title, authors, year, venue\u0026mdash;bibliographic information, Artificial Intelligence/ML Techniques Algorithm type, model complexity, training technique used\u003c/p\u003e \u003cp\u003ePrediction Scope: Code smells, bugs, security flaws, running-through problems. Static measurements, dynamic execution logs, textual data, change history in feature types. Name of the dataset, source public or private, size, domain. Accuracy, precision, recall, F1-score, AUC-ROC, MCC: evaluation metrics K-fold CV, train/test split, cross-project validation formative strategy Weka, Scikit-learn, TensorFlow, Code2Vec, SonarQube, etc. tools and frameworks\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e2.5. Evaluation of Quality:\u003c/h2\u003e \u003cp\u003eEvery study was assessed using a modified form of the Critical Appraisal Skills Program (CASP) checklist developed for software engineering in order to evaluate methodological quality and reduce bias. Among the criteria were experimental rigor, dataset availability, repeatability, and algorithm transparency. Experiments were scored as High, Medium, or Low quality; sensitivity experiments were carried out to evaluate how lower-quality research would affect overall results.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e2.6. Method of Data Synthesis:\u003c/h2\u003e \u003cp\u003eA narrative synthesis was used considering the variances in studies in terms of approaches, datasets, and results. Research was arranged thematically by model type traditional ML against deep learning and by prediction focus e.g., code smell detection, bug prediction, failure forecasting. Variations in experimental design and incompatible evaluation criteria prevented quantitative synthesis (meta-analysis). Descriptive statistics and qualitative coding helped to spot trends, popular methods, and research holes.\u003c/p\u003e \u003c/div\u003e"},{"header":"3. Results","content":"\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e3.1. Defect Prediction: AI Models:\u003c/h2\u003e \u003cp\u003eThe review turned up a wide spectrum of artificial intelligence and machine learning models used in the literature. With 130 investigations, Random Forest which boasts interpretability, robustness against overfitting, and simplicity of use emerged as the most often used model. Not far behind with 85 and 110 studies respectively are Support Vector Machines (SVM) and Neural Networks. There were sixty articles on traditional decision trees, mostly from past editions. Often used in failure prediction and log data analysis, deep learning models like CNNs and RNNs showed up in 45 and 30 research respectively. Though relatively new in this sector, transformers were utilized in 40 research showing encouraging results in natural language-based fault prediction from code comments and commit messages. The bar chart below shows this distribution, stressing in recent years a move toward more intricate models.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e3.2. Defect Prediction Features Type:\u003c/h2\u003e \u003cp\u003eMost research heavily rely on feature engineering. Applied in 220 papers, the most often employed features were Static Metrics including cyclomatic complexity, coupling, cohesiveness, and lines of code. 160 papers highlighted process metrics like code churn, commit frequency, and developer experience. Found in 100 investigations were textual features obtained from commit messages, bug reports, and inline comments. Just 50 studies mostly for failure prediction utilizing temporal deep learning models used Log Data. Fascinatingly, 120 research used combined feature sets where textual, dynamic, and static data were merged that reflected a trend toward multi-modal learning. The follwoing chart shows this distribution and implies that better prediction accuracy could result from richer feature sets.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e\u003cb\u003e3.3.\u003c/b\u003e Validation Methodologies and Evaluation Metrics:\u003c/h2\u003e \u003cp\u003eClassification performance was assessed using several criteria in evaluation of defect prediction models. Used in 480 out of 500 research, accuracy was the most often cited statistic. Because they helped to address class imbalance, F1-score and Precision were also rather common, mentioned in 410 and 370 research respectively. In 360 studies, recall was emphasized particularly in safety-critical applications where identifying all possible flaws is absolutely vital. With MCP gaining popularity for its fair assessment of binary classification tasks, AUC-ROC and Matthews Correlation Coefficient (MCC) showed up in 290 and 120 studies respectively. Most studies used k-fold cross-valuation or train/test splits, with a minority utilizing time-series validation for log-based forecasting; these metrics were sometimes used in concert. Emphasizing the community's dependence on standard classification measures, below chart offers a graphic summary of metric use.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"4. AI-Based Defect Prediction's Taxonomy","content":"\u003cp\u003eFrom early code anomalies to system breakdowns, AI-powered defect prediction encompasses several phases of the software development lifecycle. Three basic categories Code Smell Detection, Bug and Vulnerability Prediction, and Failure Forecasting separate the taxonomy presented here. As seen in the chart below, every category makes use of different artificial intelligence techniques catered to its detection objectives.\u003c/p\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003e4.1. \u003cb\u003eCode Smell Detection\u003c/b\u003e:\u003c/h2\u003e \u003cp\u003eCode smells are signs of bad design decisions that might not always result in bugs but rather generate maintainability problems and finally software defects. Typical forms are feature envy, god classes, and long methods. Using artificial intelligence models including Transformers, Convolutional Neural Networks (CNNs), and Decision Trees, these odors from stationary code metrics and source patterns have been identified. Though their scalability makes static approaches more common, both static analysis source code scanning and dynamic analysis runtime behavior inspection are applied.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eCode Smell Detection (Decision Tree)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"1\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003efrom sklearn.tree import Decision Tree Classifier\u003c/p\u003e \u003cp\u003eX = [[45, 10, 3], [10, 2, 1], [100, 20, 5]] # Features: length, complexity, loops\u003c/p\u003e \u003cp\u003ey = [1, 0, 1] # 1\u0026thinsp;=\u0026thinsp;Smell, 0\u0026thinsp;=\u0026thinsp;Clean\u003c/p\u003e \u003cp\u003emodel\u0026thinsp;=\u0026thinsp;DecisionTreeClassifier()\u003c/p\u003e \u003cp\u003emodel.fit(X, y)\u003c/p\u003e \u003cp\u003enew_method = [[60, 12, 2]]\u003c/p\u003e \u003cp\u003eprediction\u0026thinsp;=\u0026thinsp;model.predict(new_method)\u003c/p\u003e \u003cp\u003eprint(\"Smelly\" if prediction[0] else \"Clean\")\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003e\u003cb\u003e4.2. Bug and Vulnerability Forecast\u003c/b\u003e:\u003c/h2\u003e \u003cp\u003eThis domain is dedicated to spotting possibly defective code sections before they cause running problems. Usually employed are artificial intelligence models include Extreme Gradient Boosting (XGBoost), Random Forests, and Support Vector Machines (SVM). Features taken from past software repositories including commit messages, change frequency, and developer behavior form the basis of training for these models Deep learning models are used more and more because they can learn from unstructured data including natural language documentation and code text.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eBug Prediction (TF-IDF\u0026thinsp;+\u0026thinsp;Random Forest)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"1\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003efrom sklearn.ensemble import RandomForestClassifier\u003c/p\u003e \u003cp\u003efrom sklearn.feature_extraction.text import TfidfVectorizer\u003c/p\u003e \u003cp\u003ecommit_logs = [\u003c/p\u003e \u003cp\u003e \"Fixed null pointer exception in user login\",\u003c/p\u003e \u003cp\u003e \"Refactored main activity\",\u003c/p\u003e \u003cp\u003e \"Added new features to payment module\"\u003c/p\u003e \u003cp\u003e]\u003c/p\u003e \u003cp\u003elabels = [1, 0, 1] # 1\u0026thinsp;=\u0026thinsp;Bug-prone, 0\u0026thinsp;=\u0026thinsp;Safe\u003c/p\u003e \u003cp\u003evectorizer\u0026thinsp;=\u0026thinsp;TfidfVectorizer()\u003c/p\u003e \u003cp\u003eX\u0026thinsp;=\u0026thinsp;vectorizer.fit_transform(commit_logs)\u003c/p\u003e \u003cp\u003emodel\u0026thinsp;=\u0026thinsp;RandomForestClassifier()\u003c/p\u003e \u003cp\u003emodel.fit(X, labels)\u003c/p\u003e \u003cp\u003enew_commit\u0026thinsp;=\u0026thinsp;vectorizer.transform([\"Removed redundant code\"])\u003c/p\u003e \u003cp\u003eprint(\"Bug-prone\" if model.predict(new_commit)[0] else \"Safe\")\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003e\u003cb\u003e4.3.\u003c/b\u003e Predictive Failure:\u003c/h2\u003e \u003cp\u003eFailure forecasting uses temporal artificial intelligence models to predict system-level breakdowns outside of code-level problems. Analyzing time-series data from system logs, performance measures, and error traces, techniques like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks These models provide proactive maintenance and more software dependability by allowing real-time monitoring, which forecasts production environment faults before they start.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eFailure Forecasting (LSTM)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"1\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eimport tensorflow as tf\u003c/p\u003e \u003cp\u003efrom tensorflow.keras.models import Sequential\u003c/p\u003e \u003cp\u003efrom tensorflow.keras.layers import LSTM, Dense\u003c/p\u003e \u003cp\u003eimport numpy as np\u003c/p\u003e \u003cp\u003eX\u0026thinsp;=\u0026thinsp;np.random.rand(100, 10, 3) # 100 sequences, 10 timesteps, 3 features\u003c/p\u003e \u003cp\u003ey\u0026thinsp;=\u0026thinsp;np.random.randint(0, 2, size=(100, 1)) # 0\u0026thinsp;=\u0026thinsp;No Failure, 1\u0026thinsp;=\u0026thinsp;Failure\u003c/p\u003e \u003cp\u003emodel\u0026thinsp;=\u0026thinsp;Sequential([\u003c/p\u003e \u003cp\u003e LSTM(32, input_shape=(10, 3)),\u003c/p\u003e \u003cp\u003e Dense(1, activation='sigmoid')\u003c/p\u003e \u003cp\u003e])\u003c/p\u003e \u003cp\u003emodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])\u003c/p\u003e \u003cp\u003emodel.fit(X, y, epochs\u0026thinsp;=\u0026thinsp;10, verbose\u0026thinsp;=\u0026thinsp;0)\u003c/p\u003e \u003cp\u003esample_log\u0026thinsp;=\u0026thinsp;np.random.rand(1, 10, 3)\u003c/p\u003e \u003cp\u003eprint(\"Failure Risk\" if model.predict(sample_log)[0][0]\u0026thinsp;\u0026gt;\u0026thinsp;0.5 else \"Stable\")\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"5. Datasets and Tools","content":"\u003cp\u003eThe quality and diversity of datasets utilized during model training and evaluation greatly affect the effectiveness and generalizability of artificial intelligence-powered defect prediction systems. The literature has made use of a large spectrum of both private and public datasets to enable code smell detection, bug prediction, and failure prediction. Among the most often utilized are the PROMISE repository datasets, which include open-source project historical software metrics and defect classifications. Other well-known datasets are GitHub-based datasets mined using tools like PyDriller or GHTorrent to extract real-world code metrics, commit history, and issue tracking data; NASA's MDP statistics, which include module-level defect data from spaceship software systems, also fit here. These databases frequently contain many software engineering metrics like Halstead complexity metrics, Chidamber and Kemerer (CK) object-oriented metrics (e.g., WMC, DIT, NOC), and process metrics (e.g., number of commits, churn rate). More recently, natural language elements extracted from commit messages, pull request conversations, and issue descriptions have also been included into prediction models, therefore allowing a fuller, context-aware depiction of software changes. To address the difficulty of data imbalance and label shortage, some researchers have gone a step further creating synthetic datasets utilizing mutation testing or simulation environments. Many machine learning systems and software quality tools are regularly used in order to process and examine these datasets. Originally a mainstay of defect prediction studies, WEKA offers a simple interface for conventional classifiers including Naive Bayes, SVMs, and Decision Trees. Researchers mostly use TensorFlow, Keras, or PyTorch\u0026mdash;which provide flexibility in designing neural network architectures tailored to code representations, such as token sequences, abstract syntax trees (ASTs), or vector embeddings from models like Code2Vec and CodeBERT\u0026mdash;deep learning-based approaches. Moreover, a consistent supply of smell-related characteristics, tools like SonarQube and PMD are frequently used to discover code smells in Java and other languages by static code analysis (27\u0026ndash;36).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThere are still various problems even with the profusion of tools and datasets. Many datasets concentrate mostly on Java, C, and C\u0026thinsp;+\u0026thinsp;+\u0026thinsp;projects, therefore limiting in terms of language variation. Reproducibility and cross-project comparison are further seriously hampered by labeling discrepancies, changing program structures, and non-standard data formatting. Moreover, industrial datasets even if more realistic of real-world systems are sometimes inaccessible because of secrecy, therefore restricting the relevance of research results to open-source environments exclusively. Advancement of the field of artificial intelligence-driven software defect prediction depends on addressing these constraints by improved benchmark datasets, uniform labeling, and cooperative sharing programs (33).\u003c/p\u003e"},{"header":"6. Evaluation Metrics and Validation Approaches","content":"\u003cp\u003eAn important factor determining the dependability and efficiency of AI-powered defect prediction models in practical software engineering environments is their evaluation. Many measures have been used in the literature to reflect several aspects of model performance, especially in the context of imbalanced datasets where faulty modules sometimes reflect a minority class. Precision, recall, F1-score, accuracy, and Area Under the Receiver Operating Characteristically Curve (AUC-ROC) are the most often utilized measures. While recall evaluates the model's capacity to properly identify all real defect-prone modules, precision gauges the percentage of expected defect-prone modules that are actually defective. By offering a harmonic mean, the F1-Score strikes a balance between these two\u0026mdash;especially helpful when the cost of false positives and false negatives has to be taken equally into account. Conversely, AUC-ROC presents a strong indication for unbalanced datasets since it assesses the general classification capacity of the model independent of class distribution.\u003c/p\u003e \u003cp\u003eConsidered a fair measure even if class sizes vary greatly, the Matthews Correlation Coefficient (MCC) is another crucial but less often mentioned statistic that considers genuine and erroneous positives and negatives. MCC is especially important in jobs involving software defect prediction where large skewness between defective and clean modules can skew traditional measures like accuracy. To handle related problems several researches additionally apply G-means, Balanced Accuracy, or Kappa statistics.\u003c/p\u003e \u003cp\u003eResearchers often use methods including k-fold cross-valuation, in which the dataset is split into k subsets and the model is trained and tested k times, each time using a different subset as the test set and the remaining as training data, so verifying model performance and ensuring generalizability. This approach offers a more constant assessment of model performance and lowers the variance related with random train-test splits. Hold-out validation e.g., 70\u0026thinsp;\u0026minus;\u0026thinsp;30 or 80\u0026thinsp;\u0026minus;\u0026thinsp;20 splits\u0026mdash;is speedier and employed in preliminary investigations but, depending on the split ratio and data distribution, is more prone to bias. In circumstances when datasets are tiny or very skewed, stratified k-fold cross-valuation (LOOCV) and leave-one-out cross-valuation (LOOCV) are also applied. These methods guarantee that minority classes are fairly represented during training and testing by helping each fold to preserve the ratio of defective to non-defective cases. Some studies additionally include time-aware validation, particularly in failure predicting, to replicate real-world deployment situations when training takes place on past data and testing on future modules. Another necessary element of validation are comparative baselines. Often used to show improvement, artificial intelligence-based models are tested against conventional statistical techniques as logistic regression or decision trees. Comparisons between several families of artificial intelligence models\u0026mdash;such as traditional machine learning (e.g., SVM, Random Forest) and deep learning (e.g., CNN, LSTM)\u0026mdash;are also offered to benchmark developments in recent research. Furthermore, ablation experiments are carried out to investigate the effects of particular preprocessing or individual characteristics, so improving model design for improved defect prediction. Building trust in AI-based defect prediction technologies depends on thorough examination using extensive metrics and validation techniques generally. It guarantees not just academic excellence but also pragmatic relevance in actual pipelines for quality assurance and software maintenance (35, 37\u0026ndash;39).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"7. Key Findings and Trends","content":"\u003cp\u003eOver the field of software engineering, the methodical study of AI-based defect prediction methods exposes various newly developing trends and discoveries. Particularly when used to intricate representations of source code such abstract syntax trees (ASTs), graph-based structures, or embedded code sequences, one notable result is the growing efficacy of deep learning methods. Models such Graph Neural Networks (GNNs) and Long Short-Term Memory (LSTM) have shown better performance in catching both syntactic and semantic patterns connected with flaws. When paired with well-engineered stationary code metrics or historical commit data, simpler machine learning models as Random Forests, Support Vector Machines (SVM), and Gradient Boosting Trees still continue to perform competitively. From standalone code scent detection to more integrated approaches linking odors with historical bug data, code churn, and change frequency, there is a clear trend. Studies reveal that although not all code smells cause problems, some high-severity smells\u0026mdash;such as God Class, Feature Envy\u0026mdash;are statistically more linked with fault-prone components. More robust prediction findings usually come from hybrid approaches combining structural code smells with process measurements (such as the number of changes, developer activity, or issue tracker logs). An other important realization is the significance of feature quality above mere model complexity. Many high-performance studies underline the importance of domain-specific data retrieved from source code, like natural language features generated from comments and commit messages or object-oriented metrics (e.g., coupling, cohesion, inheritance depth). These characteristics are generally better than generic ones, and their careful choosing greatly increases prediction accuracy. Actually, especially in smaller datasets where overfitting still a problem, feature engineering is often found to be the decisive component in the success of a model. Furthermore, especially in industrial uses, the assessment notes a growing need for model interpretability and explainability. If the forecasts are opaque, developers sometimes object to using artificial intelligence methods. Recent studies have thus begun include visuals, SHAP values, or attention methods highlighting which code areas or attributes most affect the choice of the model. This increases developer confidence and favors practical insights over black-box alarms. Finally, in defect prediction, cross-project learning and transfer learning are clearly trending. Dataset bias and codebase-specific traits cause many models trained on a particular project to struggle to generalize. Some research investigate domain adaption methods and multi-task learning systems using data from several projects, hence enhancing generalizability. These methods are especially helpful in real-world situations when, for smaller or newer projects, labeled fault data is rare. These results taken together highlight the change from conventional static analysis to AI-enhanced, context-aware, and developer-friendly defect prediction systems, therefore advancing academic research as well as practical software quality assurance (34, 40\u0026ndash;44).\u003c/p\u003e"},{"header":"8. Future Research Directions","content":"\u003cp\u003eAlthough defect prediction driven by artificial intelligence has made great strides, some research prospects still remain unexplored. Development and integration of explainable artificial intelligence (XAI) approaches into defect prediction models is one of the most urgent directions. Most high-performance machine learning models, especially deep learning architectures, are naturally black-box in nature and so their acceptance in important software development processes is limited. Whether a code segment is defective or not, developers sometimes need reasonable justification behind a forecast. By offering human-understandable insights into how predictions are generated, XAI systems can improve trust, responsibility, and decision-making and so close the gap between model performance and developer usability. Transfer learning and domain adaptability across software projects present even another exciting avenue. Usually developed on a particular dataset, most artificial intelligence models for defect prediction struggle to generalize when implemented to other software repositories. Variations in code structure, development techniques, and project-specific traits produce this restriction. Future research should look at how models trained on one project may be adapted or fine-tuned with minimal data from another. Particularly when combined with fine-tuning systems, pretrained language models for source code\u0026mdash;such as CodeBERT or GraphCodeBERT\u0026mdash;offer fascinating chances for cross-project learning. Another important area of development is the integration of AI-based defect prediction systems inside pipelines of Continuous Integration and Continuous Deployment (CI/CD). Most current research see prediction as a stand-alone process; nonetheless, incorporating these predictions into real-time software development environments could greatly improve proactive debugging and quality assurance processes. Future studies could concentrate on creating lightweight, real-time prediction algorithms that automatically evaluate commit or pull requests and notify developers about possible hazards without appreciably raising the overhead of the pipeline. Future research could also investigate multimodal defect prediction by aggregating several kinds of input data\u0026mdash;source code, commit messages, developer comments, execution logs, even test case results. Such vast and varied data could increase the context-awareness and resilience of defect prediction models. Combining natural language processing with static code analysis, for instance, might find semantic flaws or misaligned developer intents more successfully than either approach used alone. Finally, the idea of human-in---the-loop systems is developing as a fresh hybrid method whereby artificial intelligence supports rather than replaces human judgment. These systems give developers feedback, let them interact with forecasts, and over time they let them refine models. Such configurations might provide a continuous learning loop between the artificial intelligence system and its users, aid to reduce false positives, and enable adaptation to changing codesbases. Including feedback systems into predictive instruments guarantees their development in line with the software projects they support, therefore producing more accurate and environmentally friendly defect forecasting systems.\u003c/p\u003e"},{"header":"Conclusions","content":"\u003cp\u003eRising to meet the increasing demand for scalable, accurate, proactive quality assurance methods, AI-powered defect prediction has become a transforming tool in modern software engineering. With a total synthesis of 500 peer-reviewed publications spanning more than a decade, this systematic analysis highlights the development of approaches and the growing complexity of predictive models from traditional algorithms to deep learning and transformer-based architectures. While classic models like Random Forest and SVM remain extensively used due to their robustness and interpretability, more recent developments in deep learning\u0026mdash;including CNNs, RNNs, and transformer models\u0026mdash;offer significant advantages, particularly in handling unstructured data including commit messages, source code comments, and system logs.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;The value of feature selection and representation runs across the book as a reoccurring topic. Often more important on model performance than the complexity of the learning method itself are rich, domain-specific characteristics include code smells, object-oriented metrics, change history, and semantic information. Research increasingly support hybrid feature sets combining textual data, process metrics, and static code analysis to show that a multimodal method can more precisely reflect the several character of software faults. Furthermore, integrated and longitudinal models\u0026mdash;which link early-stage indicators (such as code smells) with mid- and late-stage results\u0026mdash;such as bugs and runtime failures\u0026mdash;are clearly trending. This lifecycle-aware approach improves the capacity of AI systems to forecast not only single mistakes but possible cascades of failures, hence facilitating more strategic allocation of testing and maintenance activities. Still, numerous difficulties exist notwithstanding these developments. Dataset bias and coding standards still cause many models to suffer with generalizability across projects. Furthermore, the opacity of intricate models\u0026mdash;especially deep learning architectures\u0026mdash;causes questions regarding interpretability, which is essential for developer confidence and acceptance in practical environments. Especially in industrial settings where data is often proprietary, the lack of high-quality, consistent, and varied datasets also compromises repeatability and broad applicability.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch3\u003eAuthor Contributions\u003c/h3\u003e\n\u003cp\u003eAll authors reviewed and approved the final version of this manuscript.\u003c/p\u003e\n\u003ch3\u003eEthics Statement\u003c/h3\u003e\n\u003ch3\u003eNo ethical issues decalred.\u003c/h3\u003e\n\u003ch3\u003eConflict of Interest\u003c/h3\u003e\n\u003cp\u003eThe authors declare no conflicts of interest relevant to this systematic analysis.\u003c/p\u003e\n\u003ch3\u003eFunding\u003c/h3\u003e\n\u003cp\u003eNo external funding was used to support the development of this manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Availability Declaration\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe data supporting the findings of this study are derived from publicly available datasets commonly used in software defect prediction research, including but not limited to the PROMISE repository, NASA MDP datasets, and GitHub-based repositories obtained using tools such as PyDriller and GHTorrent. Detailed information regarding dataset names, domains, and sources is included within the manuscript. No new data were created or proprietary data used for this study. Therefore, data sharing is not applicable to this article. All data utilized in the review are available from the cited references.\u003c/p\u003e"},{"header":" References","content":"\u003col\u003e\n \u003cli\u003eAgrawal A, Menzies T (2018) Is \u0026ldquo;better data\u0026rdquo; better than \u0026ldquo;better data miners\u0026rdquo;?: on the benefits of tuning SMOTE for defect prediction. In: Proceedings of the 40th international conference on software engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, pp 1050\u0026ndash;1061.\u003c/li\u003e\n \u003cli\u003eAhluwalia A, Falessi D, Di Penta M (2019) Snoring: a noise in defect prediction datasets. In: Storey MD, Adams B, Haiduc S (eds) Proceedings of the 16th international conference on mining software repositories, MSR 2019, 26-27 May 2019. https://doi.org/10.1109/MSR.2019.00019. IEEE / ACM, Canada, pp 63\u0026ndash;67.\u003c/li\u003e\n \u003cli\u003eBird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009) Fair and balanced?: Bias in bug-fix datasets. In: Proceedings of the the 7th Joint meeting of the european software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC/FSE \u0026rsquo;09. https://doi.org/10.1145/1595696.1595716. ACM, New York, pp 121\u0026ndash;130.\u003c/li\u003e\n \u003cli\u003eChen J, Shang W, Shihab E (2020) PerfJIT: Test-level just-in-time prediction for performance regression introducing commits. IEEE Trans Softw Eng :1\u0026ndash;1. https://doi.org/10.1109/TSE.2020.3023955.\u003c/li\u003e\n \u003cli\u003eChen TH, Nagappan M, Shihab E, Hassan AE (2014) An empirical study of dormant bugs. Proceedings of the 11th Working Conference on Mining Software Repositories - MSR 2014. https://doi.org/10.1145/2597073.2597108.\u003c/li\u003e\n \u003cli\u003eChi J, Honda K, Washizaki H, Fukazawa Y, Munakata K, Morita S, Uehara T, Yamamoto R (2017) Defect analysis and prediction by applying the multistage software reliability growth model. In: IWESEP. IEEE Computer Society, pp 7\u0026ndash;11.\u003c/li\u003e\n \u003cli\u003eCogo FR, Oliva GA, Hassan AE (2019) An empirical study of dependency downgrades in the NPM ecosystem. IEEE Trans Softw Eng :1\u0026ndash;1. https://doi.org/10.1109/TSE.2019.2952130.\u003c/li\u003e\n \u003cli\u003eDalla Palma S, Di Nucci D, Palomba F, Tamburri DA (2021) Within-project defect prediction of infrastructure-as-code using product and process metrics. IEEE Trans Softw Eng :1\u0026ndash;1. https://doi.org/10.1109/TSE.2021.3051492.\u003c/li\u003e\n \u003cli\u003eDemir\u0026ouml;z G, G\u0026uuml;venir HA (1997) Classification by voting feature intervals. In: ECML, Springer, Lecture Notes in Computer Science, vol 1224, pp 85\u0026ndash;92.\u003c/li\u003e\n \u003cli\u003eFan Y, Xia X, Alencar da Costa D, Lo D, Hassan AE, Li S (2019) The impact of changes mislabeled by SZZ on just-in-time defect prediction. IEEE Trans Softw Eng :1\u0026ndash;1. https://doi.org/10.1109/TSE.2019.2929761.\u003c/li\u003e\n \u003cli\u003eFukushima T, Kamei Y, McIntosh S, Yamashita K, Ubayashi N (2014) An empirical study of just-in-time defect prediction using cross-project models. In: Proceedings of the 11th working conference on mining software repositories, pp 172\u0026ndash;181.\u003c/li\u003e\n \u003cli\u003eGhotra B, McIntosh S, Hassan AE (2017) A large-scale study of the impact of feature selection techniques on defect classification models. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR). IEEE, pp 146\u0026ndash;157.\u003c/li\u003e\n \u003cli\u003eGiger E, D\u0026rsquo;Ambros M, Pinzger M, Gall H (2012) Method-level bug prediction. pp 171\u0026ndash;180. https://doi.org/10.1145/2372251.2372285.\u003c/li\u003e\n \u003cli\u003eHerbold S (2019) On the costs and profit of software defect prediction. arXiv:1911.04309.\u003c/li\u003e\n \u003cli\u003eKamei Y, Shihab E (2016) Defect prediction: Accomplishments and future challenges. In: 2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), vol 5. IEEE, pp 33\u0026ndash;45.\u003c/li\u003e\n \u003cli\u003eMende T, Koschke R (2010) Effort-aware defect prediction models. In: Capilla R, Ferenc R, Due\u0026ntilde;as JC (eds) 14th European conference on software maintenance and reengineering, CSMR 2010, 15-18 March 2010. IEEE Computer Society, Spain, pp 107\u0026ndash;116, DOI https://doi.org/10.1109/CSMR.2010.18, (to appear in print).\u003c/li\u003e\n \u003cli\u003eAmasaki S. Cross-version defect prediction: use historical data, cross-project data, or both? Empir Softw Eng. 2020;25.\u003c/li\u003e\n \u003cli\u003eBangash AA, Sahar H, Hindle A, Ali K. On the time-based conclusion stability of cross-project defect prediction models. Empir Softw Eng. 2020;25.\u003c/li\u003e\n \u003cli\u003eBennin KE, Keung J, Phannachitta P, Monden A, Mensah S. MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Software Eng. 2018;44.\u003c/li\u003e\n \u003cli\u003eBennin KE, Keung JW, Monden A. On the relative value of data resampling approaches for software defect prediction. Empir Softw Eng. 2019;24.\u003c/li\u003e\n \u003cli\u003eFalessi D, Ahluwalia A, Penta MD. The impact of dormant defects on defect prediction: A study of 19 apache projects. ACM Trans Softw Eng Methodol (TOSEM). 2021;31.\u003c/li\u003e\n \u003cli\u003eZhao Y, Damevski K, Chen H. A systematic survey of just-in-time software defect prediction. ACM Computing Surveys. 2023;55(10):1-35.\u003c/li\u003e\n \u003cli\u003ePachouly J, Ahirrao S, Kotecha K, Selvachandran G, Abraham A. A systematic literature review on software defect prediction using artificial intelligence: Datasets, Data Validation Methods, Approaches, and Tools. Engineering Applications of Artificial Intelligence. 2022;111:104773.\u003c/li\u003e\n \u003cli\u003eLi Z, Jing X-Y, Zhu X. Progress on approaches to software defect prediction. Iet Software. 2018;12(3):161-75.\u003c/li\u003e\n \u003cli\u003eLi N, Shepperd M, Guo Y. A systematic review of unsupervised learning techniques for software defect prediction. Information and Software Technology. 2020;122:106287.\u003c/li\u003e\n \u003cli\u003eBatool I, Khan TA. Software fault prediction using data mining, machine learning and deep learning techniques: A systematic literature review. Computers and Electrical Engineering. 2022;100:107886.\u003c/li\u003e\n \u003cli\u003eZhao L, Shang Z, Zhao L, Zhang T, Tang YY. Software defect prediction via cost-sensitive Siamese parallel fully-connected neural networks. Neurocomputing. 2019;352.\u003c/li\u003e\n \u003cli\u003eXia X, Lo D, Pan SJ, Nagappan N, Wang X. Hydra: massively compositional model for cross-project defect prediction. IEEE Trans Softw Eng. 2016;42.\u003c/li\u003e\n \u003cli\u003eTong H, Liu B, Wang S. Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf Softw Technol. 2018;96.\u003c/li\u003e\n \u003cli\u003eLiang H, Yu Y, Jiang L, Xie Z. Seml: a semantic LSTM model for software defect prediction. IEEE Access. 2019;7.\u003c/li\u003e\n \u003cli\u003eMiholca DL, Czibula G, Czibula IG. A novel approach for software defect prediction through hybridizing gradual relational association rules with artificial neural networks. Inf Sci. 2018;441.\u003c/li\u003e\n \u003cli\u003eManjula C, Florence L. Deep neural network based hybrid approach for software defect prediction using software metrics. Clust Comput. 2019;22.\u003c/li\u003e\n \u003cli\u003eJayanthi R, Florence L. Software defect prediction techniques using metrics based on neural network classifier. Clust Comput. 2019;22.\u003c/li\u003e\n \u003cli\u003eKhan MZ. Hybrid ensemble learning technique for software defect prediction. Int J Modern Educ Comput Sci. 2020;12.\u003c/li\u003e\n \u003cli\u003eMajd A, Vahidi-Asl M, Khalilian A, Poorsarvi-Tehrani P, Haghighi H. SLDeep: statement-level software defect prediction using deep-learning model on static code features. Expert Syst Appl. 2020;147.\u003c/li\u003e\n \u003cli\u003eAkour M, Melhem WY. Software defect prediction using genetic programming and neural networks. Int J Open Sour Softw Process. 2017;8.\u003c/li\u003e\n \u003cli\u003eAli MM, Huda S, Abawajy J. A parallel framework for software defect detection and metric selection on cloud computing. Clust Comput. 2017;20.\u003c/li\u003e\n \u003cli\u003eKhleel NAA, Neh\u0026eacute;z K. A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method. J Intell Inf Syst. 2023;60.\u003c/li\u003e\n \u003cli\u003eKhleel NAA, Neh\u0026eacute;z K. A new approach to software defect prediction based on convolutional neural network and bidirectional long short-term memory. Prod Syst Inf Eng. 2022;10.\u003c/li\u003e\n \u003cli\u003eMohammed B, Awan I, Ugail H. Failure prediction using machine learning in a virtualised HPC system and application. Clust Comput. 2019;22.\u003c/li\u003e\n \u003cli\u003eAnbu M, Anandha Mala GS. Feature selection using firefly algorithm in software defect prediction. Clust Comput. 2019;22.\u003c/li\u003e\n \u003cli\u003ePan C, Lu M, Xu B. An improved CNN model for within-project software defect prediction. Appl Sci. 2019;9.\u003c/li\u003e\n \u003cli\u003eChen L, Fang B, Shang Z, Tang Y. Negative samples reduction in cross-company software defects prediction. Inf Softw Technol. 2015;62.\u003c/li\u003e\n \u003cli\u003eFeng S, Keung J, Yu X, Xiao Y, Zhang M. Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Inf Softw Technol. 2021;139.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"AI-powered defect prediction, code smells, bug forecasting, failure prediction, machine learning, software quality, software metrics, static and dynamic analysis, deep learning, explainable AI","lastPublishedDoi":"10.21203/rs.3.rs-6792823/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6792823/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eManual flaw discovery becomes even more insufficient as software systems grow in complexity. From early signs like code smells to late-stage system failures, this systematic study investigates the use of artificial intelligence (AI) and machine learning (ML) approaches to anticipate software problems across several phases of the software lifetime. 500 peer-reviewed papers released between 2013 and 2025 were examined for techniques, datasets, and assessment measures using PRISMA guidelines. Important artificial intelligence models consist in Random Forest, SVM, deep learning architectures, and new transformer models. Features applied span static measurements, process-based indications, and textual data from code repositories. The paper exposes a developing tendency toward hybrid models, multimodal features, and an emphasis on explainability and cross-project adaptability. Generalizability, interpretability, and dataset consistency still present difficulties notwithstanding advances. Research gaps are highlighted in the study together with future prospects including explainable artificial intelligence, real-time CI/CD integration, and human-in- the-loop systems for strong and proactive software quality assurance.\u003c/p\u003e","manuscriptTitle":"AI-Powered Defect Prediction: From Code Smells to Failure Forecasting","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-06-09 04:14:27","doi":"10.21203/rs.3.rs-6792823/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"61b8d71c-1798-41c8-8895-8c807c46a92b","owner":[],"postedDate":"June 9th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-06-09T04:14:29+00:00","versionOfRecord":[],"versionCreatedAt":"2025-06-09 04:14:27","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6792823","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6792823","identity":"rs-6792823","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.