k-Nearest Neighbour machine method for predicting resistance gene against Magnaporthe oryzae in rice using proteomic markers

doi:10.21203/rs.3.rs-4148015/v1

k-Nearest Neighbour machine method for predicting resistance gene against Magnaporthe oryzae in rice using proteomic markers

2024 · doi:10.21203/rs.3.rs-4148015/v1

preprint OA: closed

Full text JSON View at publisher

⚙ AI-generated deep summary by claude@2026-06, 2026-06-24 · read from full text ⓘ

This preprint studied how well machine-learning models can distinguish rice genes associated with resistance versus susceptibility to the fungal pathogen Magnaporthe oryzae using protein sequence–derived features. The authors collected protein sequences for 22 blast resistance genes and 18 susceptibility genes from NCBI and UniProt, extracted 20 amino-acid composition features and 400 dipeptide composition features, applied Boruta (random-forest-based) feature selection with an 80/20 train/test split, and then trained five classifiers including k-nearest neighbors (k-NN). They report that k-NN using the Boruta-selected features achieved the best performance, with about 90% accuracy and AUC ≈ 0.90, and they performed functional enrichment using STRING to identify protein-protein interaction enrichment. The paper is explicitly a preprint and not peer reviewed, and it uses a relatively small, literature-derived gene set, with only 3-fold cross-validation. The paper does not explicitly discuss endometriosis or adenomyosis; it was included in the corpus via a keyword match in the upstream search index.

Read from the paper's body, not the abstract. Not a substitute for reading the paper. No clinical advice. How this works

Full text 107,129 characters · extracted from preprint-html · click to expand

k-Nearest Neighbour machine method for predicting resistance gene against Magnaporthe oryzae in rice using proteomic markers | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article k-Nearest Neighbour machine method for predicting resistance gene against Magnaporthe oryzae in rice using proteomic markers Angelina Thomas Villikudathil, Jayachandran K, Radhakrishnan E. K. This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4148015/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 28 Jul, 2024 Read the published version in Journal of Proteins and Proteomics → Version 1 posted 8 You are reading this latest preprint version Abstract Rice blast disease, caused by the fungal pathogen Magnaporthe oryzae, poses a severe threat to global rice cultivation, impacting over 3.5 billion people and the livelihoods of 200 million. Despite challenges in achieving sustainable resistance, our study focuses on identifying proteomic signatures in blast disease-resistant and susceptible genes using amino acid and dipeptide compositions. Leveraging machine learning, particularly a k-NN model, we identified 20 molecular markers distinguishing between resistant and susceptible genes with 90% accuracy. This research highlights the potential of protein sequence-based machine learning for predicting blast disease resistance, providing valuable insights for disease-resistant breeding programs and enhancing global food security through sustainable rice cultivation. Oryza sativa Sequence Amino acid composition Dipeptide composition Machine Learning Boruta Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction The blast disease is a fungal infectious disease of rice ( Oryza sativa ) that stands as one of the most devastating infections that significantly impacts rice cultivation on a global scale [ 1 ]. Rice food security and production is essential as it is the primary source of food to more than 3.5 billion people and rice production being the primary source of income and employment for more than 200 million households across the globe. An estimate of 100% rice production yield loss is caused due to blast disease and a sustainable resistance to this disease by genes is often unavailable due to the rapidly evolving nature of the fungal pathogen to mutate and attack resistant rice cultivars [ 2 ]. There are several blast disease resistant genes in rice identified and presented in literature [ 3 – 5 ] out of which 22 genes that confer resistance have been cloned and characterized at the sequence level [ 6 ]. There are several amino acid composition and dipeptide composition studies that help in identification and characterization of the protein sequences[ 7 – 9 ] as well as understanding their evolutionary information [ 10 ]. Protein sequences are primarily composed of 20 amino acids and dipeptide compositions captures several attributes of the proteins such as the location, structure and function [ 11 ]. Protein function is determined by the protein structure which are influenced by their sequences [ 12 ]. Therefore, modifications in the amino acid composition can impact the structure of rice plant defence-related proteins involved in fungal pathogen recognition and its resistance. Amino acids and dipeptides possess the capability to regulate the enzymatic activities linked to pathways involved in defending against pathogens[ 13 ] or may act as signalling molecules or motifs involved in rice plant defence signalling pathways [ 14 ]. The prediction of disease resistant molecular markers in plants using a machine learning based approach can provide rapid insights into their identification and pathophysiology of plants [ 15 , 16 ]. There are several machine learning-based models developed to predict rice blast disease [ 17 – 22 ]. However, a rigorous approach of building machine learning models using the amino acid and dipeptide compositions from protein sequences of blast disease resistant versus susceptible genes that provides novel insights into the rice plant pathophysiology have not been researched till date based on our current awareness. This study aims to identify novel molecular markers and provide a novel machine learning tool for prediction of blast disease resistant versus susceptibility using protein sequences. We identified 20 amino acid and dipeptide compositions that can distinguish between blast disease resistant and susceptible gene groups. Cross validation was performed and the top performing machine learning model k-NN produced high classification accuracy, precision, recall and area under curve outperforming other machine learning models built in this study. Methods Data collection and sample size of this study Identification of rice blast disease resistant and susceptibility were done using literature review for reported genes involved in rice blast disease resistance and susceptibility. Protein sequences of 22 rice blast disease resistant genes and 18 disease susceptibility genes were downloaded from the National Centre for Biotechnology Information (NCBI) and UniProt databases. All the protein sequences used in this study is attached in the supplementary information. Feature extraction : All the 20 amino acid composition and 400 dipeptide composition features were extracted from the protein sequences using their FASTA format files by Protr [ 23 ] web-based tool ( http://protr.org ). Protr package is also available in R language. Amino acid and dipeptide composition refers to the proportion of each specific type of amino acid among the total amino acid count in a protein sequence. Boruta feature selection Feature selection is a preliminary step in machine learning model development, and it aims to eliminate redundant information and mitigate overfitting issues while building the machine learning models. The Boruta algorithm in R interface, which is a Random Forest based wrapper method, was used for feature selection [ 24 ]. Here, all the 420 amino acid and dipeptide composition features were split based on 80% training and 20% test set. Scaling of variables was performed using standard scaler function and the Boruta algorithm was applied for all the genes of this study. Machine learning model development and evaluation : The Boruta significant variables were used as input features to develop five ML based models using following algorithms: Support Vector Machine (SVM), Naive Bayes (NB), K-Nearest Neighbour (K-NN), Logistic Regression (LR) and Random Forest (RF). The comparative model performance was assessed by Receiver Operator Characteristic (ROC) curves. Here, the gold standard refers to the most accurate and reliable method available for determining the student versus qualified radiographers. The ML models were built to assess the efficiency of the Boruta significant variables between student and qualified radiographers. Cross validation was used to enable a robust estimation of the performance of the machine learning model. Three-fold cross validation was performed, and the machine learning models were evaluated using gold standard metrics which are elaborated in detail as adapted from our previous work [ 25 ]. Usage of softwares, packages and libraries : All analyses were performed in Jupyter Notebook [ 26 ] using Python [ 27 ] version 2.7.16 and R [ 28 ] version 4.3.0 with Python packages from sklearn: numpy, matplotlib, pandas, math, label_binarize, train_test_split, StandardScaler, SVC, GaussianNB, KNeighborsClassifier, LogisticRegression, RandomForestClassifier, classification_report, accuracy_score, make_scorer, matthews_corrcoef, roc_curve, roc_auc_score, cross_val_score, classification_report, DataFrame and R libraries: ranger and Boruta. Functional enrichment analysis : This analysis was carried out using STRING online repository version 12.0 ( https://string-db.org ) to carry out protein-protein interaction networks analysis. Results Boruta feature selection: The distribution of Z-scores boxplots was ranked by the Boruta algorithm and revealed 20 features of amino acid and dipeptide compositions such as NM, MP, HL, LE, IV, IW, G, ED, WE, GF, E, VR, I, HP, KN, CP, A, AA, RY, NN to be the significant features (Figure 1). Figure 1. Ranking of amino acid and dipeptide composition variables using Boruta algorithm upon comparing rice blast disease resistant versus susceptible genes groups. The variables were scaled, split based on training and testing of protein sequences after annotation, and Boruta algorithm was applied comparing blast disease resistant genes (n=22) and blast disease susceptible genes (n=18) for this study. The resulting important variables are depicted as Z-score boxplots ranked by the Boruta algorithm wherein green colour denotes passed important variables. Here, Figure 1 highlights 20 variables as important variables ranked by the Boruta algorithm. High performing ML models useful to predict rice blast disease resistant and susceptibility gene groups : Five ML algorithms (Support Vector Machines, Naïve Bayes, Logistic Regression, k-Nearest Neighbour and Random Forest) were trained and tested for prediction of disease resistant and susceptibility of rice blast disease groups. A detailed overview of the five comparative ML models’ performance evaluation metrics, consisting of Area Under Curve (AUC), Classification Accuracy (CA), Mathews Correlation Coefficient (MCC), precision (sensitivity), recall (specificity) and F1 score built for predicting rice blast disease resistant and susceptibility gene groups are represented in Table 1. When classifying the disease resistant and susceptibility gene groups the k-Nearest Neighbour model shows maximal classification accuracy of 90.55±8.20%, precision of 0.90, recall of 0.90, a MCC score of 0.81 and an AUC of 0.90, outperforming other ML models in Table 1. The Boruta significant variables from Figure 1 were used to build these ML models. ROC curve of the top performing ML algorithm k-NN using significant Boruta features were performed for rice blast disease resistant versus susceptibility of gene groups. The x-axis in Figure 2 denotes False Positive Rate (FPR) prediction and y-axis denotes True Positive Rate (TPR) prediction. The dotted lines in the figures represents the ROC curve for a random classification model (random performance). Legend denotes the Area Under Curve (AUC) values obtained with different ML algorithms colour coded for differentiation. Figure 2. Receiver Operator Characteristics (ROC) curves of the best performing machine learning algorithms using the Boruta important variables between rice blast disease resistant versus susceptible genes groups. Functional enrichment analysis of resistance genes: The resistance genes protein sequences have a protein-protein interaction enrichment p-value of 0.000427 and the shown proteins with interacting edges have a text mining protein-protein association (Figure 3). The Table 2 outlines key biological findings, emphasizing the strength and false discovery rates across diverse categories within the protein-protein interaction networks. Notably, the Mitochondrial Intermembrane Space Protein Transporter Complex exhibits a substantial strength of 1.71 and a low false discovery rate of 1.51E-06, with six identified genes. In the Extracellular region category, six genes are observed with a strength of 0.9 and a false discovery rate of 0.0469. The GO Function ‘ADP binding’ involves three genes with a significant strength of 1.94 and a false discovery rate of 0.0121. Plant-pathogen interaction (KEGG) includes six genes with a strength of 0.97 and a false discovery rate of 0.0035. Plant hormone signal transduction (KEGG) features four genes and a strength of 1.05, with a false discovery rate of 0.0282. In the GO Process category, defense response engages nine genes with a strength of 1.15 and an exceptionally false discovery rate of 8.15E-06. Response to stimulus (GO Process) involves 13 genes with a strength of 0.65 and a false discovery rate of 8.48E-05. Reactome pathways, including Neutrophil Degranulation, Innate Immune System, and Immune System, each exhibit a strength ranging from 0.94 to 1.26, with false discovery rates spanning from 3.84E-06 to 9.16E-09. These findings underscore the robustness and significance of the identified biological processes within the resistance genes protein-protein interaction network. Figure 3. Unveiling protein networks in resistance genes through integrated analysis of Protein-Protein interaction enrichment and text mining associations. Functional enrichment analysis of susceptible genes: The resistance genes protein sequences have a protein-protein interaction enrichment p-value of 0.131 and the shown proteins with interacting edges have been from curated databases, experimentally determined structures, text-mining, co-expression protein-protein associations (Figure 4). The Table 3 outlines key biological insights, with notable findings including an integral component of the plasma membrane involving three genes (Xa25, SWEET14, SWEET11) with a strength of 1.77 and an FDR of 0.0154. Sucrose and sugar transmembrane transporter activities feature SWEET14 and SWEET11 genes, exhibiting strengths of 2.71 and 1.9, respectively, and an FDR of 0.0168. Sequence-specific DNA binding involves five genes with a strength of 1 and an FDR of 0.0367. In defense responses, key genes like PI21 and WRKY45-2 demonstrate varying strengths and low FDRs. The cellular hexose transport pathway showcases a strength of 2.29 and an FDR of 0.00049, involving Xa25, SWEET14, and SWEET11 genes. These findings succinctly highlight the molecular functions and processes within the studied biological network, emphasizing the strengths and significance of the identified genes.Top of FormBottom of Form Figure 4. Unveiling protein networks in susceptible genes through integrated analysis of Protein-Protein interaction enrichment associations. Table 1. Evaluation metrics of the ML models performance built for rice blast disease resistant versus susceptible gene groups. Results are based on an average of the 3-fold cross validation. The top performing ML model and their metrics are highlighted for the comparison. SVM denotes for Support Vector Machines, NB for Naives Bayes, k-NN for K-Nearest Neighbour, LR for Logistic Regression, RF for Random Forest. Comparison type ML Model Area Under Curve (AUC) Classification Accuracy (CA) Matthew’s Correlation Coefficient (MCC) Precision Recall F1 score Rice blast disease resistant versus susceptible gene groups SVM 0.79 81.21±0.85% 0.65 0.86 0.80 0.79 NB 0.81 81.66±13.12% 0.63 0.81 0.81 0.81 K-NN 0.90 90.55±8.20% 0.81 0.90 0.90 0.90 LR 0.84 84.44±12.27% 0.68 0.84 0.84 0.84 RF 0.84 84.44±4.15% 0.69 0.84 0.84 0.84 Table 2. Functional enrichment analysis results for protein sequences of resistance genes of rice blast disease. Category Term ID Term description Observed gene count Strength False discovery rate Matching proteins in the network (labels) Compartments GOCC:0042719 Mitochondrial intermembrane space protein transporter complex 6 1.71 1.51E-06 OsI_06343,OsI_22582,Pid3,OsI_35589,OsI_27974,OsI_37989 GO Component GO:0005576 Extracellular region 6 0.9 0.0469 OsI_06343,OsI_22582,Pid3,OsI_35589,OsI_27974,OsI_37989 GO Function GO:0043531 ADP binding 3 1.94 0.0121 Pit,OsI_06343,OsI_35589 KEGG map04626 Plant-pathogen interaction 6 0.97 0.0035 OsI_06343,OsI_22582,Pid3,OsI_35589,OsI_27974,OsI_37989 KEGG map04075 Plant hormone signal transduction 4 1.05 0.0282 Pit,OsI_03989,OsI_16426,OsI_30909 GO Process GO:0006952 Defense response 9 1.15 8.15E-06 Pit,OsI_06343,PI21,OsI_22582,Pid3,OsI_35589,OsI_22584,OsI_27974,OsI_37989 GO Process GO:0050896 Response to stimulus 13 0.65 8.48E-05 Pit,OsI_00604,OsI_03989,OsI_06343,PI21,OsI_16426,OsI_22582,Pid3,OsI_ 35589,OsI_22584,OsI_27974,OsI_30909,OsI_37989 Reactome MAP-6798695 Neutrophil degranulation 10 1.26 9.16E-09 Pit,OsI_03989,OsI_06343,OsI_16426,OsI_22582,Pid3,OsI_35589,OsI_27974,OsI_30909,OsI_37989 Reactome MAP-168249 Innate Immune System 10 1.15 5.98E-08 Pit,OsI_03989,OsI_06343,OsI_16426,OsI_22582,Pid3,OsI_35589,OsI_27974,OsI_30909,OsI_37989 Reactome MAP-168256 Immune System 10 0.94 3.84E-06 Pit,OsI_03989,OsI_06343,OsI_16426,OsI_22582,Pid3,OsI_35589,OsI_27974,OsI_30909,OsI_37989 Table 3. Functional enrichment analysis results for protein sequences of susceptible genes of rice blast disease. Category Term ID Term description Observed gene count Strength False discovery rate Matching proteins in the network (labels) GO Component GO:0005887 Integral component of plasma membrane 3 1.77 0.0154 Xa25,SWEET14,SWEET11 GO Function GO:0008515 Sucrose transmembrane transporter activity 2 2.71 0.0168 SWEET14,SWEET11 GO Function GO:0051119 Sugar transmembrane transporter activity 3 1.9 0.0168 Xa25,SWEET14,SWEET11 GO Function GO:0043565 Sequence-specific DNA binding 5 1 0.0367 OsI_04690,OsI_04142,OsI_16171,WRKY45-2,OsI_23702 GO Process GO:0006952 Defense response 10 1.12 2.82E-06 PI21,OsI_04142,WRKY45-2,OsI_19711,OsI_21727,OsI_23702,OsI_30063,OsI_32658,SWEET11,NPR1 GO Process GO:0006950 Response to stress 12 0.79 5.92E-05 OsI_04690,PI21,OsI_34937,OsI_04142,WRKY45-2,OsI_19711,OsI_21727,OsI_23702,OsI_30063,OsI_32658,SWEET11,NPR1 GO Process GO:0051707 Response to other organism 7 1.19 0.00021 PI21,WRKY45-2,OsI_19711,OsI_30063,OsI_32658,SWEET11,NPR1 GO Process GO:0098542 Defense response to other organism 6 1.32 0.00021 PI21,WRKY45-2,OsI_19711,OsI_30063,SWEET11,NPR1 Reactome MAP-189200 Cellular hexose transport 3 2.29 0.00049 Xa25,SWEET14,SWEET11 Discussion Studying the response status of rice blast disease through protein sequence analysis holds great promise for understanding the molecular basis of resistance, developing diagnostic tools, and designing new disease control strategies. Molecular markers are specific DNA sequences or variations associated with a particular trait or phenotype, in this case, resistance or susceptibility to rice blast. If a particular amino acid or dipeptides are found as a signature between groups of resistance versus susceptible genes of rice blast disease, they can indeed be considered molecular markers. The goal is to develop broadly applicable markers, incorporating data from diverse rice varieties as this can enhance the generalizability of the identified markers and their potential for application in breeding programs. Large scale production of commercial crops has deteriorated owing to several biotic stress factors, one of the causative factors is due to fungus. Early identification of disease resistance predictors plays a major role in crop improvement. Resistant genes also termed as R genes provide resistance to pathogens by translating to R proteins. These R proteins play a major role in genetic plant breeding and pathology programs. The prediction of these genes is important as they trigger the defense system and inhibit the growth of the pathogen [ 15 ]. The present results of ML predictive models are significant in at least two major respects. Firstly, the ML model built using algorithm k-Nearest Neighbour, has an excellent prediction accuracy of 90% and ROC curve area of 0.90 using the important Boruta ranked variables from rice blast disease resistant genes versus susceptible genes. Furthermore, this approach may be used in rice crop breeding programs to predict blast disease resistant or susceptibility status, potentially earlier than is presently possible. The ML models built using the important Boruta ranked variables from rice blast disease resistant genes versus susceptible genes can be improvised further by experimenting with other feature selection methods. Secondly, the identified important Boruta ranked variables can aid knowledge discovery and can be studied further in research settings. So far, ours is the first research design to have employed the usage of traditional ML based algorithms to predict rice blast disease resistant genes versus susceptible genes using their encoding protein sequences. The functional enrichment analysis of resistance genes of rice blast disease delves into the intricate molecular workings of the studied biological system, revealing significant findings. Notably, a robust Mitochondrial Intermembrane Space Protein Transporter Complex, identified with a strength of 1.71 and a low false discovery rate (FDR) of 1.51E-06, involves six associated genes. Additionally, genes related to the extracellular region exhibit noteworthy strength (0.9, FDR: 0.0469), underlining their potential roles. Key GO Functions, such as ADP binding (strength: 1.94, FDR: 0.0121), and KEGG pathways, including plant-pathogen interaction (strength: 0.97, FDR: 0.0035) and plant hormone signal transduction (strength: 1.05, FDR: 0.0282), offer insights into crucial molecular functions. The study also sheds light on the system's defense mechanisms, with the defense response in GO Process featuring nine genes (strength: 1.15, FDR: 8.15E-06). Furthermore, reactome pathways like Neutrophil Degranulation and Innate Immune System highlight 10 genes each, displaying varying strengths (0.94 to 1.26) and low FDRs (3.84E-06 to 9.16E-09). With respect to susceptibility genes of rice blast diseases, the functional enrichment analysis has identified genes Xa25, SWEET14, and SWEET11 play crucial roles in the integral component of the plasma membrane, sugar transport, and cellular hexose transport, with strengths ranging from 1.77 to 2.71 and low false discovery rates (FDRs). Additionally, genes PI21 and WRKY45-2 are implicated in defense responses with varying strengths and consistently low FDRs. Overall, these collective findings succinctly illuminate key molecular components and functional aspects, providing a foundation for deeper exploration into the intricacies of the system's biology, including energy production, cellular functions, and responses to external threats. Limitations The study presented herein is a significant step toward understanding blast disease resistance in rice through the identification of molecular markers using protein sequence-based machine learning. However, several limitations should be considered in interpreting the findings. Firstly, the study's reliance on available protein sequences of blast disease-resistant and susceptible genes in rice may be constrained by the quality and diversity of these datasets, potentially impacting the generalizability of the identified markers and machine learning model performance. Secondly, the complexity of biological mechanisms underlying blast disease resistance might not be fully captured solely through protein sequences, overlooking other genetic, epigenetic, or environmental factors contributing to resistance. Moreover, while the k-NN machine learning model demonstrated promising performance, its applicability to different rice varieties or varying environmental conditions remains uncertain, necessitating external validation with diverse datasets. Additionally, the functional significance of the 20 identified molecular markers requires further experimental validation to elucidate their biological roles in blast disease resistance. Furthermore, the study might overlook the evolutionary dynamics of the blast pathogen and the potential impacts of genetic variations within resistant and susceptible gene groups, potentially limiting the long-term effectiveness of the identified markers. Acknowledging and addressing these limitations through expanded datasets, diverse model validations, functional studies, and considering broader biological aspects will fortify the reliability and impact of the study's findings in enhancing our understanding of blast disease resistance in rice. Conclusion In conclusion, this study represents a pivotal stride towards uncovering molecular markers for blast disease resistance in rice, employing a protein sequence-based machine learning approach. While our findings showcase promising advancements and offer insights into potential markers, several limitations underscore the need for further investigation. The reliance on available datasets of blast disease-resistant and susceptible genes in rice poses a challenge in ensuring comprehensive coverage and diversity, potentially impacting the generalizability of the identified markers. Moreover, the intricate biological mechanisms underlying blast disease resistance demand a more holistic exploration beyond protein sequences, considering additional genetic, epigenetic, and environmental factors. Although the k-NN machine learning model exhibited favourable performance, its adaptability to varying rice varieties and environmental conditions warrants scrutiny through robust external validations. Furthermore, the functional characterization of the identified molecular markers remains a critical avenue for elucidating their roles in blast disease resistance. Addressing these limitations through expanded datasets, multidimensional investigations, and functional validations will significantly augment the reliability and applicability of our findings, contributing substantially to the understanding and enhancement of blast disease resistance in rice cultivation. Declarations Authors Contributions: Conceptualization: Angelina Thomas Villikudathil, Radhakrishnan E. K; Methodology: Angelina Thomas Villikudathil; Formal analysis and data curation: Angelina Thomas Villikudathil; Supervision: Jayachandran K and Radhakrishnan E K; Writing, review and editing: Angelina Thomas Villikudathil, Jayachandran K and Radhakrishnan E K; Funding acquisition: Radhakrishnan E K and Jayachandran K. Funding: This project was financially supported by Rashtriya Uchchatar Shiksha Abhiyan (RUSA) 2.0 Major Research Project, SC/ST Cell Number 7134/SC/ST Cell/2023/MGU, Priyadarshini Hills, Dated: 26.06.2023. Competing Interests: Authors declare to have no competing interests. Availability of data and materials: All the data and materials of this study will be provided on reasonable request. Ethical Approval: There were no ethical requirements to perform this study. Consent to Participate: No participants were involved in this study. Consent to Publish: All authors have reviewed the final version of the manuscript and give consent to publish the manuscript. References Ning X, Yunyu W, Aihong L (2020) Strategy for Use of Rice Blast Resistance Genes in Rice Molecular Breeding Asibi AE, Chai Q, Coulter JA (2019) Rice blast: A disease with implications for global food security Gavhane DB, Kulwal PL, Kumbhar SD, Jadhav AS, Sarawate CD (2019) Cataloguing of blast resistance genes in landraces and breeding lines of rice from India. J Genet 98. https://doi.org/10.1007/s12041-019-1148-4 Sekhwal MK, Li P, Lam I, Wang X, Cloutier S, You FM (2015) Disease resistance gene analogs (RGAs) in plants Yadav MK, Aravindan S, Ngangkham U, Raghu S, Prabhukarthikeyan SR, Keerthana U, Marndi BC, Adak T, Munda S, Deshmukh R, Pramesh D, Samantaray S, Rath PC (2019) Blast resistance in Indian rice landraces: Genetic dissection by gene specific markers. PLoS ONE 14. https://doi.org/10.1371/journal.pone.0211061 Shikari AB, Rajashekara H, Khanna A, Gopala Krishnan S, Rathour R, Singh UD, Sharma TR, Prabhu KV, Singh AK (2014) Identification and validation of rice blast resistance genes in Indian rice germplasm. Indian J Genet Plant Breed 74:286–299. https://doi.org/10.5958/0975-6906.2014.00846.3 Lv Z, Jin S, Ding H, Zou Q (2019) A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features. Front Bioeng Biotechnol 7. https://doi.org/10.3389/fbioe.2019.00215 Xia J, Hu X, Shi F, Niu X, Zhang C (2010) Support vector machine method on predicting resistance gene against Xanthomonas oryzae pv. oryzae in rice. Expert Syst Appl 37:5946–5950. https://doi.org/10.1016/j.eswa.2010.02.010 Lobiyal Durga DK, Mohapatra P, Nagar A, Sahoo MN Proceedings of the International Conference on Signal, Networks, Computing, and Systems. Springer Kaundal R, Raghava GPS (2009) RSLpred: An integrative system for predicting subcellular localization of rice proteins combining compositional and evolutionary information. Proteomics 9:2324–2342. https://doi.org/10.1002/pmic.200700597 Kaundal R, Sahu SS, Verma R, Weirick T (2013) Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning. BMC Bioinformatics 14. https://doi.org/10.1186/1471-2105-14-S14-S7 Wan X, Tan X (2019) A study on separation of the protein structural types in amino acid sequence feature spaces. PLoS ONE 14. https://doi.org/10.1371/journal.pone.0226768 Prasannath K (2017) Plant defense-related enzymes against pathogens: a review. AGRIEAST: J Agricultural Sci 11:38. https://doi.org/10.4038/agrieast.v11i1.33 Kumar J, Ramlal A, Kumar K, Rani A, Mishra V (2021) Signaling pathways and downstream effectors of host innate immunity in plants Pal T, Jaiswal V, Chauhan RS (2016) DRPPP: A machine learning based tool for prediction of disease resistance proteins in plants. Comput Biol Med 78:42–48. https://doi.org/10.1016/j.compbiomed.2016.09.008 Saragih GS, Rustam Z (2018) Support Vector Machine with Fisher Score Feature Selection to Predict Disease-Resistant Gene in Rice. In: Journal of Physics: Conference Series. Institute of Physics Publishing Kaundal R, Kapoor AA, Raghava GPS (2006) Machine learning techniques in disease forecasting: A case study on rice blast prediction. BMC Bioinformatics 7. https://doi.org/10.1186/1471-2105-7-485 Shaik R, Ramakrishna W (2014) Machine learning approaches distinguish multiple stress conditions using stress-responsive genes and identify candidate genes for broad resistance in rice. Plant Physiol 164:481–495. https://doi.org/10.1104/pp.113.225862 Daniya T, Vigneshwari DS, Scholar R (2019) A Review on Machine Learning Techniques for Rice Plant Disease Detection in Agricultural Research. Int J Adv Sci Technol 28:49–62 Ramesh S, Vydeki D (2019) Application of machine learning in detection of blast disease in south indian rice crops. J Phytology 11:31–37. https://doi.org/10.25081/jp.2019.v11.5476 Nettleton DF, Katsantonis D, Kalaitzidis A, Sarafijanovic-Djukic N, Puigdollers P, Confalonieri R (2019) Predicting rice blast disease: Machine learning versus process-based models. BMC Bioinformatics 20. https://doi.org/10.1186/s12859-019-3065-1 Hsieh J-Y, Huang W, Yang H-T, Lin C-C, Fan Y-C, Chen H (2019) Building the Rice Blast Disease Prediction Model based on Machine Learning and Neural Networks Xiao N, Cao DS, Zhu MF, Xu QS (2015) Protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics. Oxford University Press, pp 1857–1859 Kursa MB, Rudnicki WR (2010) Feature selection with the Boruta package. J Stat Softw 36. https://doi.org/10.18637/jss.v036.i11 Rainey C, Villikudathil AT, McConnell J, Hughes C, Bond R, McFadden S (2023) An experimental machine learning study investigating the decision-making process of students and qualified radiographers when interpreting radiographic images. PLOS Digit Health 2:e0000229. https://doi.org/10.1371/journal.pdig.0000229 Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, Kelley K, Hamrick J, Grout J, Corlay S, Ivanov P, Avila D, Abdalla S, Willing C (2016) Jupyter Notebooks-a publishing format for reproducible computational workflows. Positioning and Power in Academic Publishing: Players, Agents and Agendas - Proceedings of the 20th International Conference on Electronic Publishing, ELPUB 2016. 87–90 https://doi.org/10.3233/978-1-61499-649-1-87 Menczer F, Fortunato S, Davis CA (2020) Python Tutorial. A First Course in Network Science. 221–237. https://doi.org/10.1017/9781108653947.010 Braun WJ, Murdoch DJ (2007) A First Course in Statistical Programming with R. Cambridge University Press Additional Declarations No competing interests reported. Supplementary Files SupplementaryInformation.docx Cite Share Download PDF Status: Published Journal Publication published 28 Jul, 2024 Read the published version in Journal of Proteins and Proteomics → Version 1 posted Editorial decision: Revision requested 13 Jun, 2024 Reviews received at journal 10 Jun, 2024 Reviewers agreed at journal 29 May, 2024 Reviewers agreed at journal 22 May, 2024 Reviewers invited by journal 20 May, 2024 Editor assigned by journal 31 Mar, 2024 Submission checks completed at journal 22 Mar, 2024 First submitted to journal 22 Mar, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4148015","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":282700949,"identity":"7101b802-70bd-458c-8658-75aa80e171f3","order_by":0,"name":"Angelina Thomas Villikudathil","email":"","orcid":"","institution":"Mahatma Gandhi University","correspondingAuthor":false,"prefix":"","firstName":"Angelina","middleName":"Thomas","lastName":"Villikudathil","suffix":""},{"id":282700950,"identity":"278ec6c4-2dde-44e4-959d-f2e2813b6b57","order_by":1,"name":"Jayachandran K","email":"","orcid":"","institution":"Mahatma Gandhi University","correspondingAuthor":false,"prefix":"","firstName":"Jayachandran","middleName":"","lastName":"K","suffix":""},{"id":282700952,"identity":"f456fa5f-3bed-483c-9551-af21d543d581","order_by":2,"name":"Radhakrishnan E. K.","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABBUlEQVRIiWNgGAWjYHACMwaGAww8DBJgToIcTJiHaC3GRGthgGlJbCDkKvP2w9se/DhzT0Z+du/BzwUVaelr+9cYMPyoYZAxx6FF5kxauWHPjWIegzvnkqVnnMnJ3XbjjQFjzzEGHksc9kkw5JhJ8HxI4DGQyDGQ5m2rAGo5Y8DA28DAY3AAhxb+N2aSf4Ba5GfkGP/m/VeRbgbUwvgXnxaJHDNpnhsJPAw3gAzehpwEs/M9Bsx4bZF4ViYtcwboMKAWa55jaYbbbrAVHJY5JoHHYcnbJN8cS7AHOew2T02yvNn5wxsfvqmxscelBZspCaB4kiBaPRDwE2/6KBgFo2AUjAwAALPGWS0Xi3ffAAAAAElFTkSuQmCC","orcid":"","institution":"Mahatma Gandhi University","correspondingAuthor":true,"prefix":"","firstName":"Radhakrishnan","middleName":"E.","lastName":"K.","suffix":""}],"badges":[],"createdAt":"2024-03-22 07:57:28","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4148015/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4148015/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s42485-024-00159-3","type":"published","date":"2024-07-29T00:00:00+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":53505323,"identity":"1863da9b-f08e-428e-9235-cfffbd5f7497","added_by":"auto","created_at":"2024-03-26 19:45:41","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":237537,"visible":true,"origin":"","legend":"\u003cp\u003eRanking of amino acid and dipeptide composition variables using Boruta algorithm upon comparing rice blast disease resistant versus susceptible genes groups.\u003cstrong\u003e \u003c/strong\u003eThe variables were scaled, split based on training and testing of protein sequences after annotation, and Boruta algorithm was applied comparing blast disease resistant genes (n=22) and blast disease susceptible genes (n=18) for this study. The resulting important variables are depicted as Z-score boxplots ranked by the Boruta algorithm wherein green colour denotes passed important variables. Here, Figure 1 highlights 20 variables as important variables ranked by the Boruta algorithm.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-4148015/v1/96f76adae437f76eff7f452a.png"},{"id":53505324,"identity":"d9332791-57bd-4de7-bc38-e84c9667f390","added_by":"auto","created_at":"2024-03-26 19:45:41","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":167872,"visible":true,"origin":"","legend":"\u003cp\u003eReceiver Operator Characteristics (ROC) curves of the best performing machine learning algorithms using the Boruta important variables between rice blast disease resistant versus susceptible genes groups.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-4148015/v1/6c6b9d9cc728eaa477ae1425.png"},{"id":53505321,"identity":"93c5aba5-386c-4f92-a804-277230685aa9","added_by":"auto","created_at":"2024-03-26 19:45:41","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":130703,"visible":true,"origin":"","legend":"\u003cp\u003eUnveiling protein networks in resistance genes through integrated analysis of Protein-Protein interaction enrichment and text mining associations.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-4148015/v1/ed1d8a3b869c5a67c345134c.png"},{"id":53505322,"identity":"ed1f874a-4475-42fd-a756-28b26adf43ea","added_by":"auto","created_at":"2024-03-26 19:45:41","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":166389,"visible":true,"origin":"","legend":"\u003cp\u003eUnveiling protein networks in susceptible genes through integrated analysis of Protein-Protein interaction enrichment associations.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-4148015/v1/4043827a60825c762d2f71de.png"},{"id":71062200,"identity":"789e35c2-079c-40a2-9eb2-29ae55cbae95","added_by":"auto","created_at":"2024-12-10 17:43:54","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1157789,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4148015/v1/6a8814a3-16c3-4997-80f2-06f3dc694986.pdf"},{"id":53505320,"identity":"faaab056-1d4c-4a16-86a9-4f836a3c07d8","added_by":"auto","created_at":"2024-03-26 19:45:41","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":33609,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryInformation.docx","url":"https://assets-eu.researchsquare.com/files/rs-4148015/v1/5ab7605d6b2089b496f261d8.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"k-Nearest Neighbour machine method for predicting resistance gene against Magnaporthe oryzae in rice using proteomic markers","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe blast disease is a fungal infectious disease of rice (\u003cem\u003eOryza sativa\u003c/em\u003e) that stands as one of the most devastating infections that significantly impacts rice cultivation on a global scale [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. Rice food security and production is essential as it is the primary source of food to more than 3.5\u0026nbsp;billion people and rice production being the primary source of income and employment for more than 200\u0026nbsp;million households across the globe. An estimate of 100% rice production yield loss is caused due to blast disease and a sustainable resistance to this disease by genes is often unavailable due to the rapidly evolving nature of the fungal pathogen to mutate and attack resistant rice cultivars [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. There are several blast disease resistant genes in rice identified and presented in literature [\u003cspan additionalcitationids=\"CR4\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e] out of which 22 genes that confer resistance have been cloned and characterized at the sequence level [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThere are several amino acid composition and dipeptide composition studies that help in identification and characterization of the protein sequences[\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e] as well as understanding their evolutionary information [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. Protein sequences are primarily composed of 20 amino acids and dipeptide compositions captures several attributes of the proteins such as the location, structure and function [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. Protein function is determined by the protein structure which are influenced by their sequences [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. Therefore, modifications in the amino acid composition can impact the structure of rice plant defence-related proteins involved in fungal pathogen recognition and its resistance. Amino acids and dipeptides possess the capability to regulate the enzymatic activities linked to pathways involved in defending against pathogens[\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e] or may act as signalling molecules or motifs involved in rice plant defence signalling pathways [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe prediction of disease resistant molecular markers in plants using a machine learning based approach can provide rapid insights into their identification and pathophysiology of plants [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. There are several machine learning-based models developed to predict rice blast disease [\u003cspan additionalcitationids=\"CR18 CR19 CR20 CR21\" citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. However, a rigorous approach of building machine learning models using the amino acid and dipeptide compositions from protein sequences of blast disease resistant versus susceptible genes that provides novel insights into the rice plant pathophysiology have not been researched till date based on our current awareness. This study aims to identify novel molecular markers and provide a novel machine learning tool for prediction of blast disease resistant versus susceptibility using protein sequences. We identified 20 amino acid and dipeptide compositions that can distinguish between blast disease resistant and susceptible gene groups. Cross validation was performed and the top performing machine learning model k-NN produced high classification accuracy, precision, recall and area under curve outperforming other machine learning models built in this study.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003e \u003cstrong\u003eData collection and sample size of this study\u003c/strong\u003e \u003cp\u003eIdentification of rice blast disease resistant and susceptibility were done using literature review for reported genes involved in rice blast disease resistance and susceptibility. Protein sequences of 22 rice blast disease resistant genes and 18 disease susceptibility genes were downloaded from the National Centre for Biotechnology Information (NCBI) and UniProt databases. All the protein sequences used in this study is attached in the supplementary information.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eFeature extraction\u003c/b\u003e: All the 20 amino acid composition and 400 dipeptide composition features were extracted from the protein sequences using their FASTA format files by Protr [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e] web-based tool (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://protr.org\u003c/span\u003e\u003cspan address=\"http://protr.org\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e). Protr package is also available in R language. Amino acid and dipeptide composition refers to the proportion of each specific type of amino acid among the total amino acid count in a protein sequence.\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eBoruta feature selection\u003c/strong\u003e \u003cp\u003eFeature selection is a preliminary step in machine learning model development, and it aims to eliminate redundant information and mitigate overfitting issues while building the machine learning models. The Boruta algorithm in R interface, which is a Random Forest based wrapper method, was used for feature selection [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. Here, all the 420 amino acid and dipeptide composition features were split based on 80% training and 20% test set. Scaling of variables was performed using standard scaler function and the Boruta algorithm was applied for all the genes of this study.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eMachine learning model development and evaluation\u003c/b\u003e: The Boruta significant variables were used as input features to develop five ML based models using following algorithms: Support Vector Machine (SVM), Naive Bayes (NB), K-Nearest Neighbour (K-NN), Logistic Regression (LR) and Random Forest (RF). The comparative model performance was assessed by Receiver Operator Characteristic (ROC) curves. Here, the gold standard refers to the most accurate and reliable method available for determining the student versus qualified radiographers. The ML models were built to assess the efficiency of the Boruta significant variables between student and qualified radiographers. Cross validation was used to enable a robust estimation of the performance of the machine learning model. Three-fold cross validation was performed, and the machine learning models were evaluated using gold standard metrics which are elaborated in detail as adapted from our previous work [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003cb\u003eUsage of softwares, packages and libraries\u003c/b\u003e: All analyses were performed in Jupyter Notebook [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e] using Python [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e] version 2.7.16 and R [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e] version 4.3.0 with Python packages from sklearn: numpy, matplotlib, pandas, math, label_binarize, train_test_split, StandardScaler, SVC, GaussianNB, KNeighborsClassifier, LogisticRegression, RandomForestClassifier, classification_report, accuracy_score, make_scorer, matthews_corrcoef, roc_curve, roc_auc_score, cross_val_score, classification_report, DataFrame and R libraries: ranger and Boruta.\u003c/p\u003e \u003cp\u003e \u003cb\u003eFunctional enrichment analysis\u003c/b\u003e: This analysis was carried out using STRING online repository version 12.0 (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://string-db.org\u003c/span\u003e\u003cspan address=\"https://string-db.org\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) to carry out protein-protein interaction networks analysis.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003e\u003cstrong\u003eBoruta feature selection:\u003c/strong\u003e The distribution of Z-scores boxplots was ranked by the Boruta algorithm and revealed 20 features of amino acid and dipeptide compositions such as NM, MP, HL, LE, IV, IW, G, ED, WE, GF, E, VR, I, HP, KN, CP, A, AA, RY, NN to be the significant features (Figure 1).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFigure 1.\u003c/strong\u003e Ranking of amino acid and dipeptide composition variables using Boruta algorithm upon comparing rice blast disease resistant versus susceptible genes groups.\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003eThe variables were scaled, split based on training and testing of protein sequences after annotation, and Boruta algorithm was applied comparing blast disease resistant genes (n=22) and blast disease susceptible genes (n=18) for this study. The resulting important variables are depicted as Z-score boxplots ranked by the Boruta algorithm wherein green colour denotes passed important variables. Here, Figure 1 highlights 20 variables as important variables ranked by the Boruta algorithm.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eHigh performing ML models useful to predict rice blast disease resistant and susceptibility gene groups\u003c/strong\u003e: Five ML algorithms (Support Vector Machines, Na\u0026iuml;ve Bayes, Logistic Regression, k-Nearest Neighbour and Random Forest) were trained and tested for prediction of disease resistant and susceptibility of rice blast disease groups. A detailed overview of the five comparative ML models\u0026rsquo; performance evaluation metrics, consisting of Area Under Curve (AUC), Classification Accuracy (CA), Mathews Correlation Coefficient (MCC), precision (sensitivity), recall (specificity) and F1 score built for predicting rice blast disease resistant and susceptibility gene groups are represented in Table 1. When classifying the disease resistant and susceptibility gene groups the k-Nearest Neighbour model shows maximal classification accuracy of 90.55\u0026plusmn;8.20%, precision of 0.90, recall of 0.90, a MCC score of 0.81 and an AUC of 0.90, outperforming other ML models in Table 1. The Boruta significant variables from Figure 1 were used to build these ML models.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eROC curve of the top performing ML algorithm k-NN using significant Boruta features were performed for rice blast disease resistant versus susceptibility of gene groups. The x-axis in Figure 2 denotes False Positive Rate (FPR) prediction and y-axis denotes True Positive Rate (TPR) prediction. The dotted lines in the figures represents the ROC curve for a random classification model (random performance). Legend denotes the Area Under Curve (AUC) values obtained with different ML algorithms colour coded for differentiation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFigure 2.\u003c/strong\u003e Receiver Operator Characteristics (ROC) curves of the best performing machine learning algorithms using the Boruta important variables between rice blast disease resistant versus susceptible genes groups.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunctional enrichment analysis of resistance genes:\u0026nbsp;\u003c/strong\u003eThe resistance genes protein sequences have a protein-protein interaction enrichment p-value of 0.000427 and the shown proteins with interacting edges have a text mining protein-protein association (Figure 3). The Table 2 outlines key biological findings, emphasizing the strength and false discovery rates across diverse categories within the protein-protein interaction networks. Notably, the Mitochondrial Intermembrane Space Protein Transporter Complex exhibits a substantial strength of 1.71 and a low false discovery rate of 1.51E-06, with six identified genes. In the Extracellular region category, six genes are observed with a strength of 0.9 and a false discovery rate of 0.0469. The GO Function \u0026lsquo;ADP binding\u0026rsquo; involves three genes with a significant strength of 1.94 and a false discovery rate of 0.0121. Plant-pathogen interaction (KEGG) includes six genes with a strength of 0.97 and a false discovery rate of 0.0035. Plant hormone signal transduction (KEGG) features four genes and a strength of 1.05, with a false discovery rate of 0.0282. In the GO Process category, defense response engages nine genes with a strength of 1.15 and an exceptionally false discovery rate of 8.15E-06. Response to stimulus (GO Process) involves 13 genes with a strength of 0.65 and a false discovery rate of 8.48E-05. Reactome pathways, including Neutrophil Degranulation, Innate Immune System, and Immune System, each exhibit a strength ranging from 0.94 to 1.26, with false discovery rates spanning from 3.84E-06 to 9.16E-09. These findings underscore the robustness and significance of the identified biological processes within the resistance genes protein-protein interaction network.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFigure 3.\u003c/strong\u003e Unveiling protein networks in resistance genes through integrated analysis of Protein-Protein interaction enrichment and text mining associations.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunctional enrichment analysis of susceptible genes:\u003c/strong\u003e The resistance genes protein sequences have a protein-protein interaction enrichment p-value of 0.131 and the shown proteins with interacting edges have been from curated databases, experimentally determined structures, text-mining, co-expression protein-protein associations (Figure 4). The Table 3 outlines key biological insights, with notable findings including an integral component of the plasma membrane involving three genes (Xa25, SWEET14, SWEET11) with a strength of 1.77 and an FDR of 0.0154. Sucrose and sugar transmembrane transporter activities feature SWEET14 and SWEET11 genes, exhibiting strengths of 2.71 and 1.9, respectively, and an FDR of 0.0168. Sequence-specific DNA binding involves five genes with a strength of 1 and an FDR of 0.0367. In defense responses, key genes like PI21 and WRKY45-2 demonstrate varying strengths and low FDRs. The cellular hexose transport pathway showcases a strength of 2.29 and an FDR of 0.00049, involving Xa25, SWEET14, and SWEET11 genes. These findings succinctly highlight the molecular functions and processes within the studied biological network, emphasizing the strengths and significance of the identified genes.Top of FormBottom of Form\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFigure 4.\u003c/strong\u003e Unveiling protein networks in susceptible genes through integrated analysis of Protein-Protein interaction enrichment associations.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 1.\u003c/strong\u003e Evaluation metrics of the ML models performance built for rice blast disease resistant versus susceptible gene groups.\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003eResults are based on an average of the 3-fold cross validation. The top performing ML model and their metrics are highlighted for the comparison. SVM denotes for Support Vector Machines, NB for Naives Bayes, k-NN for K-Nearest Neighbour, LR for Logistic Regression, RF for Random Forest.\u0026nbsp;\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"938\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd width=\"13.099041533546325%\"\u003e\n \u003cp\u003e\u003cstrong\u003eComparison type\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.501597444089457%\"\u003e\n \u003cp\u003e\u003cstrong\u003eML Model\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.608093716719916%\"\u003e\n \u003cp\u003e\u003cstrong\u003eArea Under Curve (AUC)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"15.97444089456869%\"\u003e\n \u003cp\u003e\u003cstrong\u003eClassification Accuracy (CA)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"12.673056443024494%\"\u003e\n \u003cp\u003e\u003cstrong\u003eMatthew\u0026rsquo;s Correlation Coefficient (MCC)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"14.802981895633653%\"\u003e\n \u003cp\u003e\u003cstrong\u003ePrecision\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.608093716719916%\"\u003e\n \u003cp\u003e\u003cstrong\u003eRecall\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.73269435569755%\"\u003e\n \u003cp\u003e\u003cstrong\u003eF1 score\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"13.099041533546325%\" rowspan=\"5\"\u003e\n \u003cp\u003eRice blast disease resistant versus susceptible gene groups\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.501597444089457%\"\u003e\n \u003cp\u003eSVM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.608093716719916%\"\u003e\n \u003cp\u003e0.79\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"15.97444089456869%\"\u003e\n \u003cp\u003e81.21\u0026plusmn;0.85%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"12.673056443024494%\"\u003e\n \u003cp\u003e0.65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"14.802981895633653%\"\u003e\n \u003cp\u003e0.86\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"11.608093716719916%\"\u003e\n \u003cp\u003e0.80\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.73269435569755%\"\u003e\n \u003cp\u003e0.79\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"13.235294117647058%\"\u003e\n \u003cp\u003eNB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"13.357843137254902%\"\u003e\n \u003cp\u003e0.81\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.38235294117647%\"\u003e\n \u003cp\u003e81.66\u0026plusmn;13.12%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"14.583333333333334%\"\u003e\n \u003cp\u003e0.63\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"17.034313725490197%\"\u003e\n \u003cp\u003e0.81\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"13.357843137254902%\"\u003e\n \u003cp\u003e0.81\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"10.049019607843137%\"\u003e\n \u003cp\u003e0.81\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"13.235294117647058%\"\u003e\n \u003cp\u003eK-NN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"13.357843137254902%\"\u003e\n \u003cp\u003e0.90\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.38235294117647%\"\u003e\n \u003cp\u003e90.55\u0026plusmn;8.20%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"14.583333333333334%\"\u003e\n \u003cp\u003e0.81\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"17.034313725490197%\"\u003e\n \u003cp\u003e0.90\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"13.357843137254902%\"\u003e\n \u003cp\u003e0.90\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"10.049019607843137%\"\u003e\n \u003cp\u003e0.90\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"13.235294117647058%\"\u003e\n \u003cp\u003eLR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"13.357843137254902%\"\u003e\n \u003cp\u003e0.84\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.38235294117647%\"\u003e\n \u003cp\u003e84.44\u0026plusmn;12.27%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"14.583333333333334%\"\u003e\n \u003cp\u003e0.68\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"17.034313725490197%\"\u003e\n \u003cp\u003e0.84\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"13.357843137254902%\"\u003e\n \u003cp\u003e0.84\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"10.049019607843137%\"\u003e\n \u003cp\u003e0.84\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"13.235294117647058%\"\u003e\n \u003cp\u003eRF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"13.357843137254902%\"\u003e\n \u003cp\u003e0.84\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.38235294117647%\"\u003e\n \u003cp\u003e84.44\u0026plusmn;4.15%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"14.583333333333334%\"\u003e\n \u003cp\u003e0.69\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"17.034313725490197%\"\u003e\n \u003cp\u003e0.84\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"13.357843137254902%\"\u003e\n \u003cp\u003e0.84\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"10.049019607843137%\"\u003e\n \u003cp\u003e0.84\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eTable 2.\u003c/strong\u003e Functional enrichment analysis results for protein sequences of resistance genes of rice blast disease.\u0026nbsp;\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"933\" style=\"margin-right: calc(43%); width: 57%;\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003e\u003cstrong\u003eCategory\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.110396570203644%\" colspan=\"2\"\u003e\n \u003cp\u003e\u003cstrong\u003eTerm ID\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.327974276527332%\"\u003e\n \u003cp\u003e\u003cstrong\u003eTerm description\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.110396570203644%\"\u003e\n \u003cp\u003e\u003cstrong\u003eObserved gene count\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003e\u003cstrong\u003eStrength\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.217577706323688%\"\u003e\n \u003cp\u003e\u003cstrong\u003eFalse discovery rate\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"36.01286173633441%\"\u003e\n \u003cp\u003e\u003cstrong\u003eMatching proteins in the network (labels)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003eCompartments\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003eGOCC:0042719\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.292604501607716%\" colspan=\"2\"\u003e\n \u003cp\u003eMitochondrial intermembrane space protein transporter complex\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.110396570203644%\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003e1.71\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.217577706323688%\"\u003e\n \u003cp\u003e1.51E-06\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"36.01286173633441%\"\u003e\n \u003cp\u003eOsI_06343,OsI_22582,Pid3,OsI_35589,OsI_27974,OsI_37989\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003eGO Component\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003eGO:0005576\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.292604501607716%\" colspan=\"2\"\u003e\n \u003cp\u003eExtracellular region\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.110396570203644%\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003e0.9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.217577706323688%\"\u003e\n \u003cp\u003e0.0469\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"36.01286173633441%\"\u003e\n \u003cp\u003eOsI_06343,OsI_22582,Pid3,OsI_35589,OsI_27974,OsI_37989\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003eGO Function\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003eGO:0043531\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.292604501607716%\" colspan=\"2\"\u003e\n \u003cp\u003eADP binding\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.110396570203644%\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003e1.94\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.217577706323688%\"\u003e\n \u003cp\u003e0.0121\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"36.01286173633441%\"\u003e\n \u003cp\u003ePit,OsI_06343,OsI_35589\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003eKEGG\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003emap04626\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.292604501607716%\" colspan=\"2\"\u003e\n \u003cp\u003ePlant-pathogen interaction\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.110396570203644%\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003e0.97\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.217577706323688%\"\u003e\n \u003cp\u003e0.0035\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"36.01286173633441%\"\u003e\n \u003cp\u003eOsI_06343,OsI_22582,Pid3,OsI_35589,OsI_27974,OsI_37989\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003eKEGG\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003emap04075\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.292604501607716%\" colspan=\"2\"\u003e\n \u003cp\u003ePlant hormone signal transduction\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.110396570203644%\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003e1.05\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.217577706323688%\"\u003e\n \u003cp\u003e0.0282\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"36.01286173633441%\"\u003e\n \u003cp\u003ePit,OsI_03989,OsI_16426,OsI_30909\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003eGO Process\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003eGO:0006952\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.292604501607716%\" colspan=\"2\"\u003e\n \u003cp\u003eDefense response\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.110396570203644%\"\u003e\n \u003cp\u003e9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003e1.15\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.217577706323688%\"\u003e\n \u003cp\u003e8.15E-06\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"36.01286173633441%\"\u003e\n \u003cp\u003ePit,OsI_06343,PI21,OsI_22582,Pid3,OsI_35589,OsI_22584,OsI_27974,OsI_37989\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003eGO Process\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003eGO:0050896\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.292604501607716%\" colspan=\"2\"\u003e\n \u003cp\u003eResponse to stimulus\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.110396570203644%\"\u003e\n \u003cp\u003e13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003e0.65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.217577706323688%\"\u003e\n \u003cp\u003e8.48E-05\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"36.01286173633441%\"\u003e\n \u003cp\u003ePit,OsI_00604,OsI_03989,OsI_06343,PI21,OsI_16426,OsI_22582,Pid3,OsI_ 35589,OsI_22584,OsI_27974,OsI_30909,OsI_37989\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003eReactome\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003eMAP-6798695\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.292604501607716%\" colspan=\"2\"\u003e\n \u003cp\u003eNeutrophil degranulation\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.110396570203644%\"\u003e\n \u003cp\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003e1.26\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.217577706323688%\"\u003e\n \u003cp\u003e9.16E-09\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"36.01286173633441%\"\u003e\n \u003cp\u003ePit,OsI_03989,OsI_06343,OsI_16426,OsI_22582,Pid3,OsI_35589,OsI_27974,OsI_30909,OsI_37989\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003eReactome\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003eMAP-168249\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.292604501607716%\" colspan=\"2\"\u003e\n \u003cp\u003eInnate Immune System\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.110396570203644%\"\u003e\n \u003cp\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003e1.15\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.217577706323688%\"\u003e\n \u003cp\u003e5.98E-08\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"36.01286173633441%\"\u003e\n \u003cp\u003ePit,OsI_03989,OsI_06343,OsI_16426,OsI_22582,Pid3,OsI_35589,OsI_27974,OsI_30909,OsI_37989\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.07502679528403%\"\u003e\n \u003cp\u003eReactome\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003eMAP-168256\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.292604501607716%\" colspan=\"2\"\u003e\n \u003cp\u003eImmune System\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.110396570203644%\"\u003e\n \u003cp\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.145766345123258%\"\u003e\n \u003cp\u003e0.94\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.217577706323688%\"\u003e\n \u003cp\u003e3.84E-06\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"36.01286173633441%\"\u003e\n \u003cp\u003ePit,OsI_03989,OsI_06343,OsI_16426,OsI_22582,Pid3,OsI_35589,OsI_27974,OsI_30909,OsI_37989\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.659873506676036%\"\u003e\u003cbr\u003e\u003c/td\u003e\n \u003ctd width=\"8.362614195361912%\"\u003e\u003cbr\u003e\u003c/td\u003e\n \u003ctd width=\"0.07027406886858749%\"\u003e\u003cbr\u003e\u003c/td\u003e\n \u003ctd width=\"7.730147575544624%\"\u003e\u003cbr\u003e\u003c/td\u003e\n \u003ctd width=\"5.621925509486999%\"\u003e\u003cbr\u003e\u003c/td\u003e\n \u003ctd width=\"5.200281096275474%\"\u003e\u003cbr\u003e\u003c/td\u003e\n \u003ctd width=\"5.551651440618412%\"\u003e\u003cbr\u003e\u003c/td\u003e\n \u003ctd width=\"59.803232607167956%\"\u003e\u003cbr\u003e\u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eTable 3.\u003c/strong\u003e Functional enrichment analysis results for protein sequences of susceptible genes of rice blast disease.\u0026nbsp;\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"100%\" style=\"margin-right: calc(39%); width: 61%;\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.309278350515465%\"\u003e\n \u003cp\u003e\u003cstrong\u003eCategory\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e\u003cstrong\u003eTerm ID\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.587628865979383%\"\u003e\n \u003cp\u003e\u003cstrong\u003eTerm description\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e\u003cstrong\u003eObserved gene count\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.24742268041237%\"\u003e\n \u003cp\u003e\u003cstrong\u003eStrength\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e\u003cstrong\u003eFalse discovery rate\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"34.02061855670103%\"\u003e\n \u003cp\u003e\u003cstrong\u003eMatching proteins in the network (labels)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.309278350515465%\"\u003e\n \u003cp\u003eGO Component\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003eGO:0005887\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.587628865979383%\"\u003e\n \u003cp\u003eIntegral component of plasma membrane\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.24742268041237%\"\u003e\n \u003cp\u003e1.77\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e0.0154\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"34.02061855670103%\"\u003e\n \u003cp\u003eXa25,SWEET14,SWEET11\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.309278350515465%\"\u003e\n \u003cp\u003eGO Function\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003eGO:0008515\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.587628865979383%\"\u003e\n \u003cp\u003eSucrose transmembrane transporter activity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.24742268041237%\"\u003e\n \u003cp\u003e2.71\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e0.0168\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"34.02061855670103%\"\u003e\n \u003cp\u003eSWEET14,SWEET11\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.309278350515465%\"\u003e\n \u003cp\u003eGO Function\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003eGO:0051119\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.587628865979383%\"\u003e\n \u003cp\u003eSugar transmembrane transporter activity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.24742268041237%\"\u003e\n \u003cp\u003e1.9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e0.0168\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"34.02061855670103%\"\u003e\n \u003cp\u003eXa25,SWEET14,SWEET11\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.309278350515465%\"\u003e\n \u003cp\u003eGO Function\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003eGO:0043565\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.587628865979383%\"\u003e\n \u003cp\u003eSequence-specific DNA binding\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.24742268041237%\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e0.0367\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"34.02061855670103%\"\u003e\n \u003cp\u003eOsI_04690,OsI_04142,OsI_16171,WRKY45-2,OsI_23702\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.309278350515465%\"\u003e\n \u003cp\u003eGO Process\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003eGO:0006952\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.587628865979383%\"\u003e\n \u003cp\u003eDefense response\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.24742268041237%\"\u003e\n \u003cp\u003e1.12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e2.82E-06\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"34.02061855670103%\"\u003e\n \u003cp\u003ePI21,OsI_04142,WRKY45-2,OsI_19711,OsI_21727,OsI_23702,OsI_30063,OsI_32658,SWEET11,NPR1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.309278350515465%\"\u003e\n \u003cp\u003eGO Process\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003eGO:0006950\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.587628865979383%\"\u003e\n \u003cp\u003eResponse to stress\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.24742268041237%\"\u003e\n \u003cp\u003e0.79\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e5.92E-05\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"34.02061855670103%\"\u003e\n \u003cp\u003eOsI_04690,PI21,OsI_34937,OsI_04142,WRKY45-2,OsI_19711,OsI_21727,OsI_23702,OsI_30063,OsI_32658,SWEET11,NPR1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.309278350515465%\"\u003e\n \u003cp\u003eGO Process\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003eGO:0051707\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.587628865979383%\"\u003e\n \u003cp\u003eResponse to other organism\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.24742268041237%\"\u003e\n \u003cp\u003e1.19\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e0.00021\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"34.02061855670103%\"\u003e\n \u003cp\u003ePI21,WRKY45-2,OsI_19711,OsI_30063,OsI_32658,SWEET11,NPR1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.309278350515465%\"\u003e\n \u003cp\u003eGO Process\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003eGO:0098542\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.587628865979383%\"\u003e\n \u003cp\u003eDefense response to other organism\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.24742268041237%\"\u003e\n \u003cp\u003e1.32\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e0.00021\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"34.02061855670103%\"\u003e\n \u003cp\u003ePI21,WRKY45-2,OsI_19711,OsI_30063,SWEET11,NPR1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"10.309278350515465%\"\u003e\n \u003cp\u003eReactome\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003eMAP-189200\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"19.587628865979383%\"\u003e\n \u003cp\u003eCellular hexose transport\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"8.24742268041237%\"\u003e\n \u003cp\u003e2.29\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"9.278350515463918%\"\u003e\n \u003cp\u003e0.00049\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"34.02061855670103%\"\u003e\n \u003cp\u003eXa25,SWEET14,SWEET11\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e"},{"header":"Discussion","content":"\u003cp\u003eStudying the response status of rice blast disease through protein sequence analysis holds great promise for understanding the molecular basis of resistance, developing diagnostic tools, and designing new disease control strategies. Molecular markers are specific DNA sequences or variations associated with a particular trait or phenotype, in this case, resistance or susceptibility to rice blast. If a particular amino acid or dipeptides are found as a signature between groups of resistance versus susceptible genes of rice blast disease, they can indeed be considered molecular markers. The goal is to develop broadly applicable markers, incorporating data from diverse rice varieties as this can enhance the generalizability of the identified markers and their potential for application in breeding programs.\u003c/p\u003e \u003cp\u003eLarge scale production of commercial crops has deteriorated owing to several biotic stress factors, one of the causative factors is due to fungus. Early identification of disease resistance predictors plays a major role in crop improvement. Resistant genes also termed as R genes provide resistance to pathogens by translating to R proteins. These R proteins play a major role in genetic plant breeding and pathology programs. The prediction of these genes is important as they trigger the defense system and inhibit the growth of the pathogen [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe present results of ML predictive models are significant in at least two major respects. Firstly, the ML model built using algorithm k-Nearest Neighbour, has an excellent prediction accuracy of 90% and ROC curve area of 0.90 using the important Boruta ranked variables from rice blast disease resistant genes versus susceptible genes. Furthermore, this approach may be used in rice crop breeding programs to predict blast disease resistant or susceptibility status, potentially earlier than is presently possible. The ML models built using the important Boruta ranked variables from rice blast disease resistant genes versus susceptible genes can be improvised further by experimenting with other feature selection methods. Secondly, the identified important Boruta ranked variables can aid knowledge discovery and can be studied further in research settings. So far, ours is the first research design to have employed the usage of traditional ML based algorithms to predict rice blast disease resistant genes versus susceptible genes using their encoding protein sequences.\u003c/p\u003e \u003cp\u003eThe functional enrichment analysis of resistance genes of rice blast disease delves into the intricate molecular workings of the studied biological system, revealing significant findings. Notably, a robust Mitochondrial Intermembrane Space Protein Transporter Complex, identified with a strength of 1.71 and a low false discovery rate (FDR) of 1.51E-06, involves six associated genes. Additionally, genes related to the extracellular region exhibit noteworthy strength (0.9, FDR: 0.0469), underlining their potential roles. Key GO Functions, such as ADP binding (strength: 1.94, FDR: 0.0121), and KEGG pathways, including plant-pathogen interaction (strength: 0.97, FDR: 0.0035) and plant hormone signal transduction (strength: 1.05, FDR: 0.0282), offer insights into crucial molecular functions. The study also sheds light on the system's defense mechanisms, with the defense response in GO Process featuring nine genes (strength: 1.15, FDR: 8.15E-06). Furthermore, reactome pathways like Neutrophil Degranulation and Innate Immune System highlight 10 genes each, displaying varying strengths (0.94 to 1.26) and low FDRs (3.84E-06 to 9.16E-09). With respect to susceptibility genes of rice blast diseases, the functional enrichment analysis has identified genes Xa25, SWEET14, and SWEET11 play crucial roles in the integral component of the plasma membrane, sugar transport, and cellular hexose transport, with strengths ranging from 1.77 to 2.71 and low false discovery rates (FDRs). Additionally, genes PI21 and WRKY45-2 are implicated in defense responses with varying strengths and consistently low FDRs. Overall, these collective findings succinctly illuminate key molecular components and functional aspects, providing a foundation for deeper exploration into the intricacies of the system's biology, including energy production, cellular functions, and responses to external threats.\u003c/p\u003e\n\u003ch3\u003eLimitations\u003c/h3\u003e\n\u003cp\u003eThe study presented herein is a significant step toward understanding blast disease resistance in rice through the identification of molecular markers using protein sequence-based machine learning. However, several limitations should be considered in interpreting the findings. Firstly, the study's reliance on available protein sequences of blast disease-resistant and susceptible genes in rice may be constrained by the quality and diversity of these datasets, potentially impacting the generalizability of the identified markers and machine learning model performance. Secondly, the complexity of biological mechanisms underlying blast disease resistance might not be fully captured solely through protein sequences, overlooking other genetic, epigenetic, or environmental factors contributing to resistance. Moreover, while the k-NN machine learning model demonstrated promising performance, its applicability to different rice varieties or varying environmental conditions remains uncertain, necessitating external validation with diverse datasets. Additionally, the functional significance of the 20 identified molecular markers requires further experimental validation to elucidate their biological roles in blast disease resistance. Furthermore, the study might overlook the evolutionary dynamics of the blast pathogen and the potential impacts of genetic variations within resistant and susceptible gene groups, potentially limiting the long-term effectiveness of the identified markers. Acknowledging and addressing these limitations through expanded datasets, diverse model validations, functional studies, and considering broader biological aspects will fortify the reliability and impact of the study's findings in enhancing our understanding of blast disease resistance in rice.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eIn conclusion, this study represents a pivotal stride towards uncovering molecular markers for blast disease resistance in rice, employing a protein sequence-based machine learning approach. While our findings showcase promising advancements and offer insights into potential markers, several limitations underscore the need for further investigation. The reliance on available datasets of blast disease-resistant and susceptible genes in rice poses a challenge in ensuring comprehensive coverage and diversity, potentially impacting the generalizability of the identified markers. Moreover, the intricate biological mechanisms underlying blast disease resistance demand a more holistic exploration beyond protein sequences, considering additional genetic, epigenetic, and environmental factors. Although the k-NN machine learning model exhibited favourable performance, its adaptability to varying rice varieties and environmental conditions warrants scrutiny through robust external validations. Furthermore, the functional characterization of the identified molecular markers remains a critical avenue for elucidating their roles in blast disease resistance. Addressing these limitations through expanded datasets, multidimensional investigations, and functional validations will significantly augment the reliability and applicability of our findings, contributing substantially to the understanding and enhancement of blast disease resistance in rice cultivation.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAuthors Contributions:\u003c/strong\u003e Conceptualization: Angelina Thomas Villikudathil, Radhakrishnan E. K; Methodology: Angelina Thomas Villikudathil; Formal analysis and data curation: Angelina Thomas Villikudathil; Supervision: Jayachandran K and Radhakrishnan E K; Writing, review and editing: Angelina Thomas Villikudathil, Jayachandran K and Radhakrishnan E K; Funding acquisition: Radhakrishnan E K and Jayachandran K.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding:\u003c/strong\u003e This project was financially supported by Rashtriya Uchchatar Shiksha Abhiyan (RUSA) 2.0 Major Research Project, SC/ST Cell Number 7134/SC/ST Cell/2023/MGU, Priyadarshini Hills, Dated: 26.06.2023.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting Interests:\u003c/strong\u003e Authors declare to have no competing interests.\u003cbr\u003e\u003cstrong\u003eAvailability of data and materials:\u003c/strong\u003e All the data and materials of this study will be provided on reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthical Approval:\u003c/strong\u003e There were no ethical requirements to perform this study.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent to Participate:\u003c/strong\u003e No participants were involved in this study.\u003cbr\u003e\u003cstrong\u003eConsent to Publish:\u003c/strong\u003e All authors have reviewed the final version of the manuscript and give consent to publish the manuscript.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eNing X, Yunyu W, Aihong L (2020) Strategy for Use of Rice Blast Resistance Genes in Rice Molecular Breeding\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAsibi AE, Chai Q, Coulter JA (2019) Rice blast: A disease with implications for global food security\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGavhane DB, Kulwal PL, Kumbhar SD, Jadhav AS, Sarawate CD (2019) Cataloguing of blast resistance genes in landraces and breeding lines of rice from India. J Genet 98. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s12041-019-1148-4\u003c/span\u003e\u003cspan address=\"10.1007/s12041-019-1148-4\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSekhwal MK, Li P, Lam I, Wang X, Cloutier S, You FM (2015) Disease resistance gene analogs (RGAs) in plants\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYadav MK, Aravindan S, Ngangkham U, Raghu S, Prabhukarthikeyan SR, Keerthana U, Marndi BC, Adak T, Munda S, Deshmukh R, Pramesh D, Samantaray S, Rath PC (2019) Blast resistance in Indian rice landraces: Genetic dissection by gene specific markers. PLoS ONE 14. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1371/journal.pone.0211061\u003c/span\u003e\u003cspan address=\"10.1371/journal.pone.0211061\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShikari AB, Rajashekara H, Khanna A, Gopala Krishnan S, Rathour R, Singh UD, Sharma TR, Prabhu KV, Singh AK (2014) Identification and validation of rice blast resistance genes in Indian rice germplasm. Indian J Genet Plant Breed 74:286\u0026ndash;299. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.5958/0975-6906.2014.00846.3\u003c/span\u003e\u003cspan address=\"10.5958/0975-6906.2014.00846.3\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLv Z, Jin S, Ding H, Zou Q (2019) A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features. Front Bioeng Biotechnol 7. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3389/fbioe.2019.00215\u003c/span\u003e\u003cspan address=\"10.3389/fbioe.2019.00215\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXia J, Hu X, Shi F, Niu X, Zhang C (2010) Support vector machine method on predicting resistance gene against Xanthomonas oryzae pv. oryzae in rice. Expert Syst Appl 37:5946\u0026ndash;5950. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.eswa.2010.02.010\u003c/span\u003e\u003cspan address=\"10.1016/j.eswa.2010.02.010\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLobiyal Durga DK, Mohapatra P, Nagar A, Sahoo MN Proceedings of the International Conference on Signal, Networks, Computing, and Systems. Springer\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKaundal R, Raghava GPS (2009) RSLpred: An integrative system for predicting subcellular localization of rice proteins combining compositional and evolutionary information. Proteomics 9:2324\u0026ndash;2342. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1002/pmic.200700597\u003c/span\u003e\u003cspan address=\"10.1002/pmic.200700597\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKaundal R, Sahu SS, Verma R, Weirick T (2013) Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning. BMC Bioinformatics 14. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1186/1471-2105-14-S14-S7\u003c/span\u003e\u003cspan address=\"10.1186/1471-2105-14-S14-S7\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWan X, Tan X (2019) A study on separation of the protein structural types in amino acid sequence feature spaces. PLoS ONE 14. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1371/journal.pone.0226768\u003c/span\u003e\u003cspan address=\"10.1371/journal.pone.0226768\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePrasannath K (2017) Plant defense-related enzymes against pathogens: a review. AGRIEAST: J Agricultural Sci 11:38. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.4038/agrieast.v11i1.33\u003c/span\u003e\u003cspan address=\"10.4038/agrieast.v11i1.33\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKumar J, Ramlal A, Kumar K, Rani A, Mishra V (2021) Signaling pathways and downstream effectors of host innate immunity in plants\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePal T, Jaiswal V, Chauhan RS (2016) DRPPP: A machine learning based tool for prediction of disease resistance proteins in plants. Comput Biol Med 78:42\u0026ndash;48. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.compbiomed.2016.09.008\u003c/span\u003e\u003cspan address=\"10.1016/j.compbiomed.2016.09.008\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSaragih GS, Rustam Z (2018) Support Vector Machine with Fisher Score Feature Selection to Predict Disease-Resistant Gene in Rice. In: Journal of Physics: Conference Series. Institute of Physics Publishing\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKaundal R, Kapoor AA, Raghava GPS (2006) Machine learning techniques in disease forecasting: A case study on rice blast prediction. BMC Bioinformatics 7. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1186/1471-2105-7-485\u003c/span\u003e\u003cspan address=\"10.1186/1471-2105-7-485\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShaik R, Ramakrishna W (2014) Machine learning approaches distinguish multiple stress conditions using stress-responsive genes and identify candidate genes for broad resistance in rice. Plant Physiol 164:481\u0026ndash;495. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1104/pp.113.225862\u003c/span\u003e\u003cspan address=\"10.1104/pp.113.225862\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDaniya T, Vigneshwari DS, Scholar R (2019) A Review on Machine Learning Techniques for Rice Plant Disease Detection in Agricultural Research. Int J Adv Sci Technol 28:49\u0026ndash;62\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRamesh S, Vydeki D (2019) Application of machine learning in detection of blast disease in south indian rice crops. J Phytology 11:31\u0026ndash;37. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.25081/jp.2019.v11.5476\u003c/span\u003e\u003cspan address=\"10.25081/jp.2019.v11.5476\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNettleton DF, Katsantonis D, Kalaitzidis A, Sarafijanovic-Djukic N, Puigdollers P, Confalonieri R (2019) Predicting rice blast disease: Machine learning versus process-based models. BMC Bioinformatics 20. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1186/s12859-019-3065-1\u003c/span\u003e\u003cspan address=\"10.1186/s12859-019-3065-1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHsieh J-Y, Huang W, Yang H-T, Lin C-C, Fan Y-C, Chen H (2019) Building the Rice Blast Disease Prediction Model based on Machine Learning and Neural Networks\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXiao N, Cao DS, Zhu MF, Xu QS (2015) Protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics. Oxford University Press, pp 1857\u0026ndash;1859\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKursa MB, Rudnicki WR (2010) Feature selection with the Boruta package. J Stat Softw 36. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.18637/jss.v036.i11\u003c/span\u003e\u003cspan address=\"10.18637/jss.v036.i11\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRainey C, Villikudathil AT, McConnell J, Hughes C, Bond R, McFadden S (2023) An experimental machine learning study investigating the decision-making process of students and qualified radiographers when interpreting radiographic images. PLOS Digit Health 2:e0000229. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1371/journal.pdig.0000229\u003c/span\u003e\u003cspan address=\"10.1371/journal.pdig.0000229\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKluyver T, Ragan-Kelley B, P\u0026eacute;rez F, Granger B, Bussonnier M, Frederic J, Kelley K, Hamrick J, Grout J, Corlay S, Ivanov P, Avila D, Abdalla S, Willing C (2016) Jupyter Notebooks-a publishing format for reproducible computational workflows. Positioning and Power in Academic Publishing: Players, Agents and Agendas - Proceedings of the 20th International Conference on Electronic Publishing, ELPUB 2016. 87\u0026ndash;90 \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3233/978-1-61499-649-1-87\u003c/span\u003e\u003cspan address=\"10.3233/978-1-61499-649-1-87\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMenczer F, Fortunato S, Davis CA (2020) Python Tutorial. A First Course in Network Science. 221\u0026ndash;237. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1017/9781108653947.010\u003c/span\u003e\u003cspan address=\"10.1017/9781108653947.010\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBraun WJ, Murdoch DJ (2007) A First Course in Statistical Programming with R. Cambridge University Press\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"journal-of-proteins-and-proteomics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [Journal of Proteins and Proteomics](https://www.springer.com/journal/42485)","snPcode":"42485","submissionUrl":"https://submission.nature.com/new-submission/42485/3","title":"Journal of Proteins and Proteomics","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Oryza sativa, Sequence, Amino acid composition, Dipeptide composition, Machine Learning, Boruta","lastPublishedDoi":"10.21203/rs.3.rs-4148015/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4148015/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eRice blast disease, caused by the fungal pathogen Magnaporthe oryzae, poses a severe threat to global rice cultivation, impacting over 3.5\u0026nbsp;billion people and the livelihoods of 200\u0026nbsp;million. Despite challenges in achieving sustainable resistance, our study focuses on identifying proteomic signatures in blast disease-resistant and susceptible genes using amino acid and dipeptide compositions. Leveraging machine learning, particularly a k-NN model, we identified 20 molecular markers distinguishing between resistant and susceptible genes with 90% accuracy. This research highlights the potential of protein sequence-based machine learning for predicting blast disease resistance, providing valuable insights for disease-resistant breeding programs and enhancing global food security through sustainable rice cultivation.\u003c/p\u003e","manuscriptTitle":"k-Nearest Neighbour machine method for predicting resistance gene against Magnaporthe oryzae in rice using proteomic markers","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-03-26 19:45:37","doi":"10.21203/rs.3.rs-4148015/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2024-06-13T08:49:11+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-06-10T14:11:56+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"211007342355872405486036248464405131159","date":"2024-05-29T07:58:35+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"70112668788325724408207375888160990268","date":"2024-05-22T06:03:55+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-05-20T05:39:49+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-03-31T12:18:52+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-03-22T12:53:38+00:00","index":"","fulltext":""},{"type":"submitted","content":"Journal of Proteins and Proteomics","date":"2024-03-22T07:56:01+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"journal-of-proteins-and-proteomics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [Journal of Proteins and Proteomics](https://www.springer.com/journal/42485)","snPcode":"42485","submissionUrl":"https://submission.nature.com/new-submission/42485/3","title":"Journal of Proteins and Proteomics","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"49e24d15-b47a-444d-b8b5-21e1a248b7d0","owner":[],"postedDate":"March 26th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2024-12-10T17:43:49+00:00","versionOfRecord":{"articleIdentity":"rs-4148015","link":"https://doi.org/10.1007/s42485-024-00159-3","journal":{"identity":"journal-of-proteins-and-proteomics","isVorOnly":false,"title":"Journal of Proteins and Proteomics"},"publishedOn":"2024-07-29 00:00:00","publishedOnDateReadable":"July 29th, 2024"},"versionCreatedAt":"2024-03-26 19:45:37","video":"","vorDoi":"10.1007/s42485-024-00159-3","vorDoiUrl":"https://doi.org/10.1007/s42485-024-00159-3","workflowStages":[]},"version":"v1","identity":"rs-4148015","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4148015","identity":"rs-4148015","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00