Cluster-Based Data Balancing for Protein Contact Map Prediction Using MLP Neural Networks

doi:10.21203/rs.3.rs-7528788/v1

Cluster-Based Data Balancing for Protein Contact Map Prediction Using MLP Neural Networks

2025 · doi:10.21203/rs.3.rs-7528788/v1

preprint OA: closed

Full text JSON View at publisher

Full text 73,087 characters · extracted from preprint-html · click to expand

Cluster-Based Data Balancing for Protein Contact Map Prediction Using MLP Neural Networks | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Method Article Cluster-Based Data Balancing for Protein Contact Map Prediction Using MLP Neural Networks Samaneh Saghafi, Maryam Saghafi This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7528788/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Class imbalance poses a significant challenge in binary classification tasks, particularly in bioinformatics applications such as protein contact map prediction. This study addresses the tendency of conventional classifiers to misclassify minority class instances by proposing a novel, intelligent data balancing method. The primary objective was to enhance prediction accuracy and sensitivity in protein contact map classification by employing a cluster-based undersampling technique integrated with a multilayer perceptron (MLP) neural network. Protein sequence data were extracted from established bioinformatics databases, and the class imbalance was mitigated using a k-means clustering algorithm. The majority class samples were divided into clusters, and representative instances were selected based on their proximity to cluster boundaries and intra-cluster density. The resulting dataset, balanced at a 4:1 ratio, was used to train and evaluate an MLP classifier optimized with the Levenberg–Marquardt algorithm. Performance metrics such as accuracy, sensitivity, precision, and F1-score were analyzed across multiple experimental groups. The proposed method outperformed conventional random undersampling, achieving an average accuracy of 93% and sensitivity of 61%, compared to 90% and 50% respectively for the baseline approach. These results demonstrate that cluster-aware sampling significantly improves classification performance in imbalanced datasets. This approach offers a practical and effective strategy for enhancing predictive models in bioinformatics. Future research may explore its integration with deep learning architectures and extension to multi-class prediction problems. Protein contact map prediction imbalanced data classification cluster-based undersampling k-means clustering multilayer perceptron data balancing techniques bioinformatics classification Figures Figure 1 Introduction The ongoing endeavor to predict protein structures poses a significant challenge in the bioinformatics domain, primarily due to its considerable consequences for illuminating cellular mechanisms, advancing drug development, and understanding disease dynamics (Jumper et al., 2021).[ 1 ] An essential part of protein configuration is the contact map, which outlines the spatial links between amino acid residues and operates as a structure for identifying advanced structural layouts (Shackelford & Karplus, 2007).[ 2 ] The careful prediction of these contact maps presents the opportunity to bolster computational models applied in the sectors of structural biology and proteomics. Nevertheless, a persistent impediment in this domain is the phenomenon of data imbalance, particularly the disproportionate occurrence of non-contact (negative) residue pairs in contrast to contact (positive) residue pairs within protein datasets (Habibi et al., 2013).[ 3 ] This disparity leads to standard machine learning classifiers showing reduced effectiveness, as they often lean towards the dominant class, thus resulting in less than ideal prediction accuracy for the lesser class, which is vital in practical situations (Buda et al, 2018).[ 4 ] Consequently, addressing this imbalance is imperative for improving the reliability of models in the prediction of protein contacts. This inquiry primarily highlights the notion of data balance, especially through employing complex sampling methods that aim to alleviate class imbalance in supervised learning scenarios. Conventional strategies, including random undersampling and the Synthetic Minority Oversampling Technique (SMOTE), have been extensively implemented; however, they frequently encounter limitations such as the potential for information loss or the inadvertent incorporation of noise (Chawla et al., 2002; Batista et al., 2004).[ 5 , 6 ] To mitigate these restrictions, current studies have analyzed the implementation of clustering algorithms for improved maintenance of the distributional integrity and representativeness of datasets. In particular, the k-means clustering algorithm has been endorsed as a methodology for stratifying majority class samples into more substantively pertinent subgroups based on density and their spatial proximity to decision boundaries (Bholowalia & Kumar, 2014).[ 7 ] The current investigation expands upon this research trajectory by proposing an innovative cluster-based undersampling technique that judiciously identifies majority class samples in accordance with intra-cluster attributes. This methodology augments data representativeness while concurrently mitigating classifier bias, thereby facilitating an enhancement in the overall efficacy of the model (Lemaître et al., 2017).[ 8 ] The variable subject to dependence in the present investigation pertains to the efficacy of classification in the context of predicting protein contact maps, which is explicitly quantified utilizing metrics including accuracy, sensitivity (true positive rate), precision, and the F1-score. This set of evaluative standards offers a detailed review of the model's skill in accurately recognizing authentic contact pairs while lessening the effect of the dominant non-contact pairs (Eickholt & Cheng, 2015).[ 9 ] The significance of sensitivity in bioinformatics frameworks is vital, since false negatives can cause misinterpretations of protein configurations (Chicco & Jurman, 2020; Saito & Rehmsmeier, 2015).[ 10 , 11 ] Previous investigations utilizing deep learning algorithms and ensemble methodologies, including PSICOV, MetaPSICOV, and DNcon, have demonstrated encouraging outcomes; however, they frequently depend on well-balanced datasets or subsequent filtering techniques (Jones et al., 2015; Wang et al., 2017).[ 12 , 13 ] Hence, the impact of elaborate data balancing techniques is a significant influence in realizing consistent classification results (Buda et al, 2018).[ 4 ] This research is dedicated to presenting empirical evidence that supports the effectiveness of neural network-driven methods in refining contact prediction efficiency, using a classifier (MLP) trained on carefully chosen datasets (Goodfellow, Bengio, & Courville, 2016).[ 14 ] Consequently, the significance of advanced data balancing practices is a vital component in realising uniform classification results. This scholarly investigation aims to provide empirical substantiation that advocates for the efficacy of neural network-based approaches in enhancing the efficiency of contact prediction, utilizing a classifier (MLP) that is trained on meticulously curated datasets (Zhao & Shrivastava, 2013; Zhang et al., 2014).[ 15 , 16 ] Furthermore, there is a deficiency of systematic exploration into how the selection of samples, based on their distance to cluster boundaries and density, influences model sensitivity and accuracy—key components for biological interpretation (Barua et al., 2012).[ 17 ] The dissertation in this article delineates this deficiency and suggests a systematic, two-tiered sampling framework that emphasizes selection criteria informed by cluster characteristics (Lemaître et al., 2017).[ 8 ] This technique exceeds mere unpredictability or artificially generated information, instead emphasizing the underlying structural traits of the data itself (López et al., 2013).[ 18 ] The absence of prior application and validation of such a technique highlights the imperative for additional empirical exploration, which this research endeavors to accomplish. This inquiry seeks to analyze how a cluster-oriented data balancing technique could potentially boost the success of Multilayer Perceptron (MLP) classifiers in the realm of protein contact map prediction. In particular, the research scrutinizes the implications of majority class sample selection—predicated on proximity to cluster boundaries and intra-cluster density—on classification performance metrics, including sensitivity and F1-score. The research inquiries that direct the investigation encompass: (1) Does the suggested sampling technique enhance classification sensitivity in comparison to conventional random undersampling? (2) In what manner does cluster-informed selection influence overall predictive accuracy and equilibrium? To address these inquiries, the study utilizes a comparative experimental framework, employing protein sequence data and assessing model efficacy under varying sampling scenarios. The manuscript is organized as follows: the Methods segment elaborates on the experimental design, datasets, and methodologies employed; the Results segment articulates the empirical findings; the Discussion offers an interpretation of these results in the context of prevailing scholarship and inherent limitations; and the Conclusion encapsulates the principal contributions while suggesting avenues for subsequent inquiry. Materials and Methods This inquiry utilized a structured quantitative and controlled experimental technique to investigate the repercussions of a cluster-based undersampling approach on the efficiency of classifying protein contact maps. Executed within a computational research environment, the study was informed by theoretical frameworks from the fields of machine learning and bioinformatics, with a specific emphasis on supervised classification employing neural networks. This research explored the efficiency of multilayer perceptron (MLP) neural networks that were educated on datasets with diverse balance characteristics, sourced from data regarding protein sequences. The principal aim was to determine whether an intelligent sampling strategy, predicated on cluster attributes, could improve sensitivity and precision in a binary classification challenge marked by significant class imbalance. The processes of data preprocessing, model training, and validation were executed in accordance with recognized quantitative methodologies and statistical performance indicators. The data sets comprised protein residue pair information sourced from reputable bioinformatics repositories, notably the Protein Data Bank (PDB) and files formatted in FASTA. The collections featured spatial traits of amino acid residues, arranged into two separate categories: positive samples (signifying contact residues) and negative samples (signifying non-contact residues). There were no human or animal participants involved in this study. The dataset presented a core imbalance, featured by a notably excessive rate of negative examples. A purposive sampling technique was employed for the dominant class to mitigate the problem of overrepresentation. The identification of optimal k-values for k-means clustering was accomplished through the application of the Elbow method, subsequently leading to the formation of clusters, with sample selection predicated on intra-cluster density and spatial proximity to cluster boundaries. A consistent ratio of 4:1 was upheld between negative and positive samples across all experimental groups to guarantee uniformity and comparability among the conditions. The apparatus and methodologies employed in this investigation encompassed MATLAB software for the purposes of neural network modeling and data analysis, alongside k-means clustering to achieve data stratification. The implementation of the MLP neural network was carried out utilizing MATLAB’s Neural Network Toolbox, which enabled training through the Levenberg–Marquardt algorithm—a reputable and extensively utilized backpropagation technique distinguished for its expeditious convergence and precision in tasks involving function approximation. Meticulously designed scripts were put together to enhance the execution of a clever sampling process, which focuses on the measurement of distances between samples across chosen clusters. Data files obtained from the Protein Data Bank (PDB) and FASTA formats were systematically parsed and preprocessed employing conventional bioinformatics protocols, with features such as position-specific scoring matrices (PSSMs) being extracted through the utilization of BLAST and HSSP computational tools. No tangible laboratory apparatus was necessary, since all evaluations were conducted through computational methods. The process of gathering data included multiple phases, starting with the initial steps of refining and pulling out characteristics from the raw protein sequence documents. Features derived from Position-Specific Scoring Matrices (PSSM) and Homology-derived Secondary Structure of Proteins (HSSP) were aggregated to formulate the input vectors pertinent to each residue pair. Samples associated with the dominant class went through k-means clustering, and within every identified cluster, the sample selection was decided by reviewing density metrics and their distance from the cluster boundaries. Subsequently, the refined datasets were introduced into Multi-Layer Perceptron (MLP) networks characterized by diverse architectural configurations, and the training of the models was conducted employing a conventional 10-fold cross-validation methodology. The performance assessment utilized metrics like accuracy, PPV, TPR, and F1-score, ensuring a detailed inspection of the effectiveness of classification. Ethics clearance was viewed as not needed, because the investigation made use of biological datasets that are available to the public and did not involve human or animal experimentation. The fusion of the experimental architecture, precisely curated samples, computational methods, and methodical protocols devised a coherent methodological framework aimed at unpacking the research quandary detailed in the article. Uniting diverse clustering methodologies with principles of supervised learning ushered in a creative solution for managing the frequent dilemma of data imbalance in bioinformatics classification. The adoption of consistent ratios in sampling, reliable metrics for performance assessment, and thorough cross-validation strategies enhances both the rigor and reproducibility of the study. All methodological procedures were rigorously designed to assess the effectiveness of cluster-informed data balancing strategies in enhancing the predictive accuracy of protein contact maps, thus establishing a reproducible framework for future research in this domain. Results The comprehensive results of the experimental investigation revealed that the suggested method for cluster-based data balancing significantly enhanced the efficacy of the multilayer perceptron (MLP) classifier in the prediction of protein contact maps. Across various experimental cohorts, the findings exhibited consistent improvements in sensitivity and F1-score when the samples from the majority class were selected in accordance with intra-cluster density and proximity to the cluster boundaries. The methodology demonstrated significantly superior efficacy when juxtaposed with the conventional technique of random undersampling, which generally eliminated 20% of negative instances without taking into account the distribution of the samples. Grouped sampling methodologies that integrate k-means clustering alongside density-based selection produced more robust and precise classification outcomes. Diverse configurations associated with neural network architectures and assessed undersampling conditions illuminated these discernible patterns within the research. Addressing the initial research inquiry—whether the proposed cluster-based undersampling technique enhances classification sensitivity in comparison to random sampling—the results substantiated a distinct performance improvement. The mean sensitivity (true positive rate) for the control method (random 20% undersampling) was documented at 50.02%, whereas the cluster-based LDLD (Low Density–Low Distance) method attained an average sensitivity of 56.59%, and the MDMD (More Density–More Distance) method reached a maximum average sensitivity of 61.70%. These enhancements were consistently observed across all seven experimental cohorts evaluated. For instance, in Group 4 of the MDMD configuration, the sensitivity achieved its pinnacle individual value of 63.55%, as delineated in Table 1 of the article. These advancements underscore the efficacy of integrating density and boundary information during the sample selection process. Table 1 Results of MLP neural network using:(a) 20% undersampling,(b) LDLD (Low Density–Low Distance)(c) MDMD (More Density–More Distance) Method Criteria Sensitivity Accuracy Specificity 20% Undersampling 0.5002 0.9626 0.9047 MDMD 0.6170 0.9104 0.9327 LDLD 0.5659 0.9042 0.9331 In conjunction with the second research objective, which investigated the influence of the proposed sampling methodology on overall classification precision and F1-score, the findings further indicated quantifiable enhancements. The primary technique generated a typical accuracy of 90.47% together with an F1-score of 0.5002. Alternatively, the LDLD process secured an average accuracy of 93.31% along with an F1-score of 0.5659, compared to the MDMD model that noted an average accuracy of 93.27% with an F1-score of 0.6170. The evaluation metrics illustrated a steady benefit for the bulk of groups, indicating that the cluster-focused selection approach not only heightened sensitivity but also adjusted the overall balance between precision and recall. The performance data derived from the MLP classifiers trained on these datasets were encapsulated in Table 1 , and visually corroborated by Fig. 1 , which illustrated the trends of sensitivity and F1-score across various groups. Additional significant findings encompass the efficacy of the Elbow method in ascertaining optimal cluster quantities for the k-means algorithm. Analysis of various clustering frameworks demonstrated that categorization based on both density and boundary distance led to noticeably improved results when set against sampling methods that depended exclusively on one criterion or arbitrary choice. Moreover, across all experimental cohorts, the MDMD configuration consistently demonstrated the highest sensitivity and F1-score, implying that the synergistic effect of sample density and boundary proximity was more efficacious than the influence of either factor in isolation. It is noteworthy that the cluster-based methodologies exhibited a more consistent classification performance across differing threshold values and counts of MLP neurons. There were no discernible anomalies or deteriorations in performance evident within the experimental data, and the findings consistently reinforced the legitimacy of the proposed sampling strategy across various network and dataset configurations. Discussion The results of this investigation furnish substantial evidence that a cluster-based undersampling methodology significantly enhances the efficacy of classification models in the realm of imbalanced datasets, particularly in the domain of protein contact map prediction. By judiciously selecting majority class samples predicated upon intra-cluster density and their proximity to the boundaries of clusters, the model demonstrated enhanced sensitivity and F1-scores, thereby signifying a more equitable and precise classification of contact and non-contact residue pairs. These observations confirm the theory that incorporating structural properties of the dataset into the sampling process can ease the biases characteristically tied to random undersampling strategies (Batista et al., 2004).[ 6 ] The sophisticated sampling framework elucidated within the dissertation holds theoretical importance as it capitalizes on the intrinsic properties of the data to guide machine learning preprocessing, which is consistent with the overarching trend in bioinformatics towards the development of computational models that possess biological relevance (Ma et al., 2015).[ 19 ] Augmented sensitivity in identifying residue interactions considerably enhances the accuracy of subsequent structural predictions, thus offering significant advantages for activities related to protein folding and drug development. The results of the analysis resonate with and further illuminate the insights obtained from earlier academic efforts regarding data imbalance and its significance for classification effectiveness. Standard approaches, including SMOTE and cost-sensitive learning, have addressed the challenge of class imbalance to different extents; nonetheless, they tend to produce synthetic datasets or rely on outside cost evaluations that might be hard to align correctly. The ongoing methodology differs from older techniques by merging core structural elements of the dataset—particularly, density of clusters and distance measurements—ensuring that the original data distribution remains intact. This sets it apart from antecedent clustering-based methodologies such as boundary sampling or rough-fuzzy classification, which predominantly concentrated on geometric or probabilistic boundaries without accounting for density considerations (Peng et al., 2014; Mazumder et al., 2015).[ 20 , 21 ] The methodology presented in this dissertation enhances the contributions of Jones et al. (2015)[ 12 ] and Eickholt and Cheng (2015),[ 9 ] whose ensemble and deep learning methodologies have derived advantages from the utilization of balanced input data; however, they were deficient in the implementation of systematic strategies to address data imbalance prior to the training phase. Consequently, the present outcomes advance the development of pre-classification data balancing techniques within the realm of protein structure prediction. In light of the favorable results, one must still identify several limitations that are built into the empirical structure applied. The dataset was meticulously crafted with an emphasis on the binary classification of interactions among residues, thereby restricting its relevance to scenarios associated with multi-class or multi-label bioinformatics. Moreover, although the Elbow method has demonstrated its efficacy in determining the optimal number of clusters, the selection procedure may be prone to the intrinsic variability present in the data, which could impede reproducibility across various protein datasets or structural configurations. The research also refrained from evaluating the methodology utilizing alternative machine learning models beyond the multilayer perceptron (MLP), thereby leaving unresolved inquiries regarding its compatibility with more sophisticated deep learning architectures, including convolutional neural networks (CNNs) or transformers. Subsequent investigations ought to delve into these extensions, assess the resilience of the clustering methodology across a more extensive array of biological datasets, and examine adaptive cluster selection strategies that modify sampling in accordance with the intricacies of the dataset. Such advancements possess the potential to enhance the significance and applicability of the technique in practical implementations of structural biology and systems bioinformatics. Conclusion This research endeavor aimed to assess the efficacy of an innovative cluster-centric data balancing technique for enhancing classification outcomes in the prediction of protein contact maps. The empirical results substantiated that the application of majority class undersampling, guided by intra-cluster density and proximity to cluster boundaries, resulted in significant enhancements in sensitivity, accuracy, and F1-score when juxtaposed with traditional random sampling methodologies. By elucidating the central research inquiry, the findings indicated that the suggested LDLD and MDMD methodologies yielded a more equitable and representative training dataset for the MLP classifier, thereby augmenting its efficacy in identifying instances of the minority class. The results obtained from this study empirically corroborated the research hypothesis positing that data balancing predicated on the structural characteristics of the dataset could produce enhanced classification results in instances of imbalanced bioinformatics challenges. The broad consequences of this inquiry are applicable to both conceptual frameworks and tangible implementations in machine learning and bioinformatics. The proposed methodology offers a comprehensive and adaptable framework for the preprocessing of imbalanced datasets in classification tasks where the identification of minority classes is critically significant, as illustrated by its applications in disease marker prediction, gene function clarification, and protein interaction analysis. Moreover, eschewing the generation of synthetic data while capitalizing on inherent data characteristics, this methodology is congruent with optimal practices in machine learning pertinent to biological relevance. Future investigations may build upon this study by amalgamating the sampling framework with advanced deep learning paradigms, including convolutional and transformer-based architectures, evaluating its resilience in multi-class contexts, and extending its application to additional fields where class imbalance presents a notable obstacle. The results significantly advance the continuous evolution of sophisticated preprocessing methodologies that improve model equity, accuracy, and applicability in empirical scientific contexts. Declarations Author Contribution Contributed to coding and implementation of the programs References Jumper J et al. Highly accurate protein structure prediction with AlphaFold, nature , vol. 596, no. 7873, pp. 583–589, 2021. Shackelford G, Karplus K. Contact prediction using mutual information and neural nets. Proteins Struct Funct Bioinform. 2007;69:159–64. Habibi N, Saraee M, Korbekandi H. Protein contact map prediction using committee machine approach. Int J Data Min Bioinform. 2013;7(4):397–415. Buda M, Maki A, Mazurowski MA. A systematic study of the class imbalance problem in convolutional neural networks, Neural networks , vol. 106, pp. 249–259, 2018. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57. Batista GE, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsl. 2004;6(1):20–9. Bholowalia P, Kumar A. EBK-means: A clustering technique based on elbow method and k-means in WSN. Int J Comput Appl, 105, 9, 2014. LemaÃŽtre G, Nogueira F, Aridas CK. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res. 2017;18(17):1–5. Eickholt J, Cheng J. A study and benchmark of DNcon: a method for protein residue-residue contact prediction using deep networks. BMC Bioinformatics. 2013;14:S12. Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE. 2015;10(3):e0118432. Jones DT, Singh T, Kosciolek T, Tetchner S. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins, Bioinformatics , vol. 31, no. 7, pp. 999–1006, 2015. Wang S, Sun S, Li Z, Zhang R, Xu J. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput Biol. 2017;13(1):e1005324. Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning. Volume 2. MIT press Cambridge; 2016. Zhao Y, Shrivastava AK. Combating sub-clusters effect in imbalanced classification, in 2013 IEEE 13th International Conference on Data Mining , 2013: IEEE, pp. 1295–1300. Zhang Y, Fu P, Liu W, Chen G. Imbalanced data classification based on scaling kernel-based support vector machine. Neural Comput Appl. 2014;25(3):927–35. Barua S, Islam MM, Yao X, Murase K. MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng. 2012;26(2):405–25. López V, Fernández A, García S, Palade V, Herrera F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf Sci. 2013;250:113–41. Ma J, Wang S, Wang Z, Xu J. Protein contact prediction by integrating joint evolutionary coupling analysis and supervised learning, Bioinformatics , vol. 31, no. 21, pp. 3506–3513, 2015. Peng L, Xiao-yang Y, Ting-ting B, Jiu-ling H. Imbalanced data SVM classification method based on cluster boundary sampling and DT-KNN pruning. Int J Signal Process. 2014;7(2):61–8. Image Processing and Pattern Recognition. Mazumder RU, Begum SA, Biswas D. Rough Fuzzy classification for class imbalanced data, in Proceedings of Fourth International Conference on Soft Computing for Problem Solving: SocProS 2014, Volume 1 , 2014: Springer, pp. 159–171. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7528788","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Method Article","associatedPublications":[],"authors":[{"id":511465521,"identity":"74cd4514-a73f-4e36-86ed-7ebbfa82ef1b","order_by":0,"name":"Samaneh Saghafi","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABCUlEQVRIiWNgGAWjYPCCAyCCjYHBRkIOzH1AvJY0CWMwN4EELQyJDSAmPi3y7acTPxdU3AEyjl978CPBIn1+2OGHQFvs5HQbsGsxOJO7WXrGmWdARk65YU+CRO7G22kGQC3JxmYHcGhhyN0gzdt2GMjISZPg/QHUMjsBpOVA4jYcWuT7327+zfvvMJDxJk3yT4JEuuHs9A94tTDcyN0mzdtwGMhIPybNkyCRIC+dg98Wgxtvt1nzHDvMY3DjDZu0TIKE4QbpnIIDCQa4/SLfn7v5Nk/NYTn5/vRnkm8S6uTlZ6dv/vChwk4OlxYY4AEiA4i9ByDBQgxgfwCxt4Eo1aNgFIyCUTCCAAAfL2RCXm+CRAAAAABJRU5ErkJggg==","orcid":"","institution":"","correspondingAuthor":true,"prefix":"","firstName":"Samaneh","middleName":"","lastName":"Saghafi","suffix":""},{"id":511465522,"identity":"ce1b5109-6b51-430d-903a-bb667d015133","order_by":1,"name":"Maryam Saghafi","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Maryam","middleName":"","lastName":"Saghafi","suffix":""}],"badges":[],"createdAt":"2025-09-03 15:53:24","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7528788/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7528788/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":90837954,"identity":"5f0ad1ca-a86a-4829-afc8-78dd8388c464","added_by":"auto","created_at":"2025-09-08 18:18:46","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":23070,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of TPR rates after applying three data \u0026nbsp;\u0026nbsp;balancing methods to each group\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-7528788/v1/1914fc02d2dc3459f47c63e1.png"},{"id":90838542,"identity":"50555101-e217-4745-a6db-3bfe7e318a87","added_by":"auto","created_at":"2025-09-08 18:26:48","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":336171,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7528788/v1/11e3385e-2392-4e5b-b38a-a71441123cfe.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Cluster-Based Data Balancing for Protein Contact Map Prediction Using MLP Neural Networks","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe ongoing endeavor to predict protein structures poses a significant challenge in the bioinformatics domain, primarily due to its considerable consequences for illuminating cellular mechanisms, advancing drug development, and understanding disease dynamics (Jumper et al., 2021).[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e] An essential part of protein configuration is the contact map, which outlines the spatial links between amino acid residues and operates as a structure for identifying advanced structural layouts (Shackelford \u0026amp; Karplus, 2007).[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e] The careful prediction of these contact maps presents the opportunity to bolster computational models applied in the sectors of structural biology and proteomics. Nevertheless, a persistent impediment in this domain is the phenomenon of data imbalance, particularly the disproportionate occurrence of non-contact (negative) residue pairs in contrast to contact (positive) residue pairs within protein datasets (Habibi et al., 2013).[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e] This disparity leads to standard machine learning classifiers showing reduced effectiveness, as they often lean towards the dominant class, thus resulting in less than ideal prediction accuracy for the lesser class, which is vital in practical situations (Buda et al, 2018).[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e] Consequently, addressing this imbalance is imperative for improving the reliability of models in the prediction of protein contacts.\u003c/p\u003e\u003cp\u003eThis inquiry primarily highlights the notion of data balance, especially through employing complex sampling methods that aim to alleviate class imbalance in supervised learning scenarios. Conventional strategies, including random undersampling and the Synthetic Minority Oversampling Technique (SMOTE), have been extensively implemented; however, they frequently encounter limitations such as the potential for information loss or the inadvertent incorporation of noise (Chawla et al., 2002; Batista et al., 2004).[\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e] To mitigate these restrictions, current studies have analyzed the implementation of clustering algorithms for improved maintenance of the distributional integrity and representativeness of datasets. In particular, the k-means clustering algorithm has been endorsed as a methodology for stratifying majority class samples into more substantively pertinent subgroups based on density and their spatial proximity to decision boundaries (Bholowalia \u0026amp; Kumar, 2014).[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e] The current investigation expands upon this research trajectory by proposing an innovative cluster-based undersampling technique that judiciously identifies majority class samples in accordance with intra-cluster attributes. This methodology augments data representativeness while concurrently mitigating classifier bias, thereby facilitating an enhancement in the overall efficacy of the model (Lema\u0026icirc;tre et al., 2017).[\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]\u003c/p\u003e\u003cp\u003eThe variable subject to dependence in the present investigation pertains to the efficacy of classification in the context of predicting protein contact maps, which is explicitly quantified utilizing metrics including accuracy, sensitivity (true positive rate), precision, and the F1-score. This set of evaluative standards offers a detailed review of the model's skill in accurately recognizing authentic contact pairs while lessening the effect of the dominant non-contact pairs (Eickholt \u0026amp; Cheng, 2015).[\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e] The significance of sensitivity in bioinformatics frameworks is vital, since false negatives can cause misinterpretations of protein configurations (Chicco \u0026amp; Jurman, 2020; Saito \u0026amp; Rehmsmeier, 2015).[\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e] Previous investigations utilizing deep learning algorithms and ensemble methodologies, including PSICOV, MetaPSICOV, and DNcon, have demonstrated encouraging outcomes; however, they frequently depend on well-balanced datasets or subsequent filtering techniques (Jones et al., 2015; Wang et al., 2017).[\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e] Hence, the impact of elaborate data balancing techniques is a significant influence in realizing consistent classification results (Buda et al, 2018).[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e] This research is dedicated to presenting empirical evidence that supports the effectiveness of neural network-driven methods in refining contact prediction efficiency, using a classifier (MLP) trained on carefully chosen datasets (Goodfellow, Bengio, \u0026amp; Courville, 2016).[\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]\u003c/p\u003e\u003cp\u003eConsequently, the significance of advanced data balancing practices is a vital component in realising uniform classification results. This scholarly investigation aims to provide empirical substantiation that advocates for the efficacy of neural network-based approaches in enhancing the efficiency of contact prediction, utilizing a classifier (MLP) that is trained on meticulously curated datasets (Zhao \u0026amp; Shrivastava, 2013; Zhang et al., 2014).[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] Furthermore, there is a deficiency of systematic exploration into how the selection of samples, based on their distance to cluster boundaries and density, influences model sensitivity and accuracy\u0026mdash;key components for biological interpretation (Barua et al., 2012).[\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e] The dissertation in this article delineates this deficiency and suggests a systematic, two-tiered sampling framework that emphasizes selection criteria informed by cluster characteristics (Lema\u0026icirc;tre et al., 2017).[\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e] This technique exceeds mere unpredictability or artificially generated information, instead emphasizing the underlying structural traits of the data itself (L\u0026oacute;pez et al., 2013).[\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e] The absence of prior application and validation of such a technique highlights the imperative for additional empirical exploration, which this research endeavors to accomplish.\u003c/p\u003e\u003cp\u003eThis inquiry seeks to analyze how a cluster-oriented data balancing technique could potentially boost the success of Multilayer Perceptron (MLP) classifiers in the realm of protein contact map prediction. In particular, the research scrutinizes the implications of majority class sample selection\u0026mdash;predicated on proximity to cluster boundaries and intra-cluster density\u0026mdash;on classification performance metrics, including sensitivity and F1-score. The research inquiries that direct the investigation encompass: (1) Does the suggested sampling technique enhance classification sensitivity in comparison to conventional random undersampling? (2) In what manner does cluster-informed selection influence overall predictive accuracy and equilibrium? To address these inquiries, the study utilizes a comparative experimental framework, employing protein sequence data and assessing model efficacy under varying sampling scenarios.\u003c/p\u003e\u003cp\u003eThe manuscript is organized as follows: the Methods segment elaborates on the experimental design, datasets, and methodologies employed; the Results segment articulates the empirical findings; the Discussion offers an interpretation of these results in the context of prevailing scholarship and inherent limitations; and the Conclusion encapsulates the principal contributions while suggesting avenues for subsequent inquiry.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003cp\u003eThis inquiry utilized a structured quantitative and controlled experimental technique to investigate the repercussions of a cluster-based undersampling approach on the efficiency of classifying protein contact maps. Executed within a computational research environment, the study was informed by theoretical frameworks from the fields of machine learning and bioinformatics, with a specific emphasis on supervised classification employing neural networks. This research explored the efficiency of multilayer perceptron (MLP) neural networks that were educated on datasets with diverse balance characteristics, sourced from data regarding protein sequences. The principal aim was to determine whether an intelligent sampling strategy, predicated on cluster attributes, could improve sensitivity and precision in a binary classification challenge marked by significant class imbalance. The processes of data preprocessing, model training, and validation were executed in accordance with recognized quantitative methodologies and statistical performance indicators.\u003c/p\u003e\u003cp\u003eThe data sets comprised protein residue pair information sourced from reputable bioinformatics repositories, notably the Protein Data Bank (PDB) and files formatted in FASTA. The collections featured spatial traits of amino acid residues, arranged into two separate categories: positive samples (signifying contact residues) and negative samples (signifying non-contact residues). There were no human or animal participants involved in this study. The dataset presented a core imbalance, featured by a notably excessive rate of negative examples. A purposive sampling technique was employed for the dominant class to mitigate the problem of overrepresentation. The identification of optimal k-values for k-means clustering was accomplished through the application of the Elbow method, subsequently leading to the formation of clusters, with sample selection predicated on intra-cluster density and spatial proximity to cluster boundaries. A consistent ratio of 4:1 was upheld between negative and positive samples across all experimental groups to guarantee uniformity and comparability among the conditions.\u003c/p\u003e\u003cp\u003eThe apparatus and methodologies employed in this investigation encompassed MATLAB software for the purposes of neural network modeling and data analysis, alongside k-means clustering to achieve data stratification. The implementation of the MLP neural network was carried out utilizing MATLAB\u0026rsquo;s Neural Network Toolbox, which enabled training through the Levenberg\u0026ndash;Marquardt algorithm\u0026mdash;a reputable and extensively utilized backpropagation technique distinguished for its expeditious convergence and precision in tasks involving function approximation. Meticulously designed scripts were put together to enhance the execution of a clever sampling process, which focuses on the measurement of distances between samples across chosen clusters. Data files obtained from the Protein Data Bank (PDB) and FASTA formats were systematically parsed and preprocessed employing conventional bioinformatics protocols, with features such as position-specific scoring matrices (PSSMs) being extracted through the utilization of BLAST and HSSP computational tools. No tangible laboratory apparatus was necessary, since all evaluations were conducted through computational methods.\u003c/p\u003e\u003cp\u003eThe process of gathering data included multiple phases, starting with the initial steps of refining and pulling out characteristics from the raw protein sequence documents. Features derived from Position-Specific Scoring Matrices (PSSM) and Homology-derived Secondary Structure of Proteins (HSSP) were aggregated to formulate the input vectors pertinent to each residue pair. Samples associated with the dominant class went through k-means clustering, and within every identified cluster, the sample selection was decided by reviewing density metrics and their distance from the cluster boundaries. Subsequently, the refined datasets were introduced into Multi-Layer Perceptron (MLP) networks characterized by diverse architectural configurations, and the training of the models was conducted employing a conventional 10-fold cross-validation methodology. The performance assessment utilized metrics like accuracy, PPV, TPR, and F1-score, ensuring a detailed inspection of the effectiveness of classification. Ethics clearance was viewed as not needed, because the investigation made use of biological datasets that are available to the public and did not involve human or animal experimentation.\u003c/p\u003e\u003cp\u003eThe fusion of the experimental architecture, precisely curated samples, computational methods, and methodical protocols devised a coherent methodological framework aimed at unpacking the research quandary detailed in the article. Uniting diverse clustering methodologies with principles of supervised learning ushered in a creative solution for managing the frequent dilemma of data imbalance in bioinformatics classification. The adoption of consistent ratios in sampling, reliable metrics for performance assessment, and thorough cross-validation strategies enhances both the rigor and reproducibility of the study. All methodological procedures were rigorously designed to assess the effectiveness of cluster-informed data balancing strategies in enhancing the predictive accuracy of protein contact maps, thus establishing a reproducible framework for future research in this domain.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003eThe comprehensive results of the experimental investigation revealed that the suggested method for cluster-based data balancing significantly enhanced the efficacy of the multilayer perceptron (MLP) classifier in the prediction of protein contact maps. Across various experimental cohorts, the findings exhibited consistent improvements in sensitivity and F1-score when the samples from the majority class were selected in accordance with intra-cluster density and proximity to the cluster boundaries. The methodology demonstrated significantly superior efficacy when juxtaposed with the conventional technique of random undersampling, which generally eliminated 20% of negative instances without taking into account the distribution of the samples. Grouped sampling methodologies that integrate k-means clustering alongside density-based selection produced more robust and precise classification outcomes. Diverse configurations associated with neural network architectures and assessed undersampling conditions illuminated these discernible patterns within the research.\u003c/p\u003e\u003cp\u003eAddressing the initial research inquiry\u0026mdash;whether the proposed cluster-based undersampling technique enhances classification sensitivity in comparison to random sampling\u0026mdash;the results substantiated a distinct performance improvement. The mean sensitivity (true positive rate) for the control method (random 20% undersampling) was documented at 50.02%, whereas the cluster-based LDLD (Low Density\u0026ndash;Low Distance) method attained an average sensitivity of 56.59%, and the MDMD (More Density\u0026ndash;More Distance) method reached a maximum average sensitivity of 61.70%. These enhancements were consistently observed across all seven experimental cohorts evaluated. For instance, in Group 4 of the MDMD configuration, the sensitivity achieved its pinnacle individual value of 63.55%, as delineated in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e of the article. These advancements underscore the efficacy of integrating density and boundary information during the sample selection process.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eResults of MLP neural network using:(a) 20% undersampling,(b) LDLD (Low Density\u0026ndash;Low Distance)(c) MDMD (More Density\u0026ndash;More Distance)\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"4\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eMethod\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"3\" nameend=\"c4\" namest=\"c2\"\u003e\u003cp\u003eCriteria\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSensitivity\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eAccuracy\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eSpecificity\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e20% Undersampling\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.5002\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.9626\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.9047\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMDMD\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.6170\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.9104\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.9327\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLDLD\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.5659\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.9042\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.9331\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eIn conjunction with the second research objective, which investigated the influence of the proposed sampling methodology on overall classification precision and F1-score, the findings further indicated quantifiable enhancements. The primary technique generated a typical accuracy of 90.47% together with an F1-score of 0.5002. Alternatively, the LDLD process secured an average accuracy of 93.31% along with an F1-score of 0.5659, compared to the MDMD model that noted an average accuracy of 93.27% with an F1-score of 0.6170. The evaluation metrics illustrated a steady benefit for the bulk of groups, indicating that the cluster-focused selection approach not only heightened sensitivity but also adjusted the overall balance between precision and recall. The performance data derived from the MLP classifiers trained on these datasets were encapsulated in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, and visually corroborated by Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, which illustrated the trends of sensitivity and F1-score across various groups.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eAdditional significant findings encompass the efficacy of the Elbow method in ascertaining optimal cluster quantities for the k-means algorithm. Analysis of various clustering frameworks demonstrated that categorization based on both density and boundary distance led to noticeably improved results when set against sampling methods that depended exclusively on one criterion or arbitrary choice. Moreover, across all experimental cohorts, the MDMD configuration consistently demonstrated the highest sensitivity and F1-score, implying that the synergistic effect of sample density and boundary proximity was more efficacious than the influence of either factor in isolation. It is noteworthy that the cluster-based methodologies exhibited a more consistent classification performance across differing threshold values and counts of MLP neurons. There were no discernible anomalies or deteriorations in performance evident within the experimental data, and the findings consistently reinforced the legitimacy of the proposed sampling strategy across various network and dataset configurations.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe results of this investigation furnish substantial evidence that a cluster-based undersampling methodology significantly enhances the efficacy of classification models in the realm of imbalanced datasets, particularly in the domain of protein contact map prediction. By judiciously selecting majority class samples predicated upon intra-cluster density and their proximity to the boundaries of clusters, the model demonstrated enhanced sensitivity and F1-scores, thereby signifying a more equitable and precise classification of contact and non-contact residue pairs. These observations confirm the theory that incorporating structural properties of the dataset into the sampling process can ease the biases characteristically tied to random undersampling strategies (Batista et al., 2004).[\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e] The sophisticated sampling framework elucidated within the dissertation holds theoretical importance as it capitalizes on the intrinsic properties of the data to guide machine learning preprocessing, which is consistent with the overarching trend in bioinformatics towards the development of computational models that possess biological relevance (Ma et al., 2015).[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e] Augmented sensitivity in identifying residue interactions considerably enhances the accuracy of subsequent structural predictions, thus offering significant advantages for activities related to protein folding and drug development.\u003c/p\u003e\u003cp\u003eThe results of the analysis resonate with and further illuminate the insights obtained from earlier academic efforts regarding data imbalance and its significance for classification effectiveness. Standard approaches, including SMOTE and cost-sensitive learning, have addressed the challenge of class imbalance to different extents; nonetheless, they tend to produce synthetic datasets or rely on outside cost evaluations that might be hard to align correctly. The ongoing methodology differs from older techniques by merging core structural elements of the dataset\u0026mdash;particularly, density of clusters and distance measurements\u0026mdash;ensuring that the original data distribution remains intact. This sets it apart from antecedent clustering-based methodologies such as boundary sampling or rough-fuzzy classification, which predominantly concentrated on geometric or probabilistic boundaries without accounting for density considerations (Peng et al., 2014; Mazumder et al., 2015).[\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e, \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e] The methodology presented in this dissertation enhances the contributions of Jones et al. (2015)[\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] and Eickholt and Cheng (2015),[\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e] whose ensemble and deep learning methodologies have derived advantages from the utilization of balanced input data; however, they were deficient in the implementation of systematic strategies to address data imbalance prior to the training phase. Consequently, the present outcomes advance the development of pre-classification data balancing techniques within the realm of protein structure prediction.\u003c/p\u003e\u003cp\u003eIn light of the favorable results, one must still identify several limitations that are built into the empirical structure applied. The dataset was meticulously crafted with an emphasis on the binary classification of interactions among residues, thereby restricting its relevance to scenarios associated with multi-class or multi-label bioinformatics. Moreover, although the Elbow method has demonstrated its efficacy in determining the optimal number of clusters, the selection procedure may be prone to the intrinsic variability present in the data, which could impede reproducibility across various protein datasets or structural configurations. The research also refrained from evaluating the methodology utilizing alternative machine learning models beyond the multilayer perceptron (MLP), thereby leaving unresolved inquiries regarding its compatibility with more sophisticated deep learning architectures, including convolutional neural networks (CNNs) or transformers. Subsequent investigations ought to delve into these extensions, assess the resilience of the clustering methodology across a more extensive array of biological datasets, and examine adaptive cluster selection strategies that modify sampling in accordance with the intricacies of the dataset. Such advancements possess the potential to enhance the significance and applicability of the technique in practical implementations of structural biology and systems bioinformatics.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis research endeavor aimed to assess the efficacy of an innovative cluster-centric data balancing technique for enhancing classification outcomes in the prediction of protein contact maps. The empirical results substantiated that the application of majority class undersampling, guided by intra-cluster density and proximity to cluster boundaries, resulted in significant enhancements in sensitivity, accuracy, and F1-score when juxtaposed with traditional random sampling methodologies. By elucidating the central research inquiry, the findings indicated that the suggested LDLD and MDMD methodologies yielded a more equitable and representative training dataset for the MLP classifier, thereby augmenting its efficacy in identifying instances of the minority class. The results obtained from this study empirically corroborated the research hypothesis positing that data balancing predicated on the structural characteristics of the dataset could produce enhanced classification results in instances of imbalanced bioinformatics challenges.\u003c/p\u003e\u003cp\u003eThe broad consequences of this inquiry are applicable to both conceptual frameworks and tangible implementations in machine learning and bioinformatics. The proposed methodology offers a comprehensive and adaptable framework for the preprocessing of imbalanced datasets in classification tasks where the identification of minority classes is critically significant, as illustrated by its applications in disease marker prediction, gene function clarification, and protein interaction analysis. Moreover, eschewing the generation of synthetic data while capitalizing on inherent data characteristics, this methodology is congruent with optimal practices in machine learning pertinent to biological relevance. Future investigations may build upon this study by amalgamating the sampling framework with advanced deep learning paradigms, including convolutional and transformer-based architectures, evaluating its resilience in multi-class contexts, and extending its application to additional fields where class imbalance presents a notable obstacle. The results significantly advance the continuous evolution of sophisticated preprocessing methodologies that improve model equity, accuracy, and applicability in empirical scientific contexts.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eContributed to coding and implementation of the programs\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eJumper J et al. Highly accurate protein structure prediction with AlphaFold, \u003cem\u003enature\u003c/em\u003e, vol. 596, no. 7873, pp. 583\u0026ndash;589, 2021.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eShackelford G, Karplus K. Contact prediction using mutual information and neural nets. Proteins Struct Funct Bioinform. 2007;69:159\u0026ndash;64.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHabibi N, Saraee M, Korbekandi H. Protein contact map prediction using committee machine approach. Int J Data Min Bioinform. 2013;7(4):397\u0026ndash;415.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBuda M, Maki A, Mazurowski MA. A systematic study of the class imbalance problem in convolutional neural networks, \u003cem\u003eNeural networks\u003c/em\u003e, vol. 106, pp. 249\u0026ndash;259, 2018.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321\u0026ndash;57.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBatista GE, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsl. 2004;6(1):20\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBholowalia P, Kumar A. EBK-means: A clustering technique based on elbow method and k-means in WSN. Int J Comput Appl, 105, 9, 2014.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLema\u0026Atilde;Žtre G, Nogueira F, Aridas CK. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res. 2017;18(17):1\u0026ndash;5.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eEickholt J, Cheng J. A study and benchmark of DNcon: a method for protein residue-residue contact prediction using deep networks. BMC Bioinformatics. 2013;14:S12.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSaito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE. 2015;10(3):e0118432.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJones DT, Singh T, Kosciolek T, Tetchner S. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins, \u003cem\u003eBioinformatics\u003c/em\u003e, vol. 31, no. 7, pp. 999\u0026ndash;1006, 2015.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang S, Sun S, Li Z, Zhang R, Xu J. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput Biol. 2017;13(1):e1005324.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGoodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning. Volume 2. MIT press Cambridge; 2016.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhao Y, Shrivastava AK. Combating sub-clusters effect in imbalanced classification, in 2013 \u003cem\u003eIEEE 13th International Conference on Data Mining\u003c/em\u003e, 2013: IEEE, pp. 1295\u0026ndash;1300.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhang Y, Fu P, Liu W, Chen G. Imbalanced data classification based on scaling kernel-based support vector machine. Neural Comput Appl. 2014;25(3):927\u0026ndash;35.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBarua S, Islam MM, Yao X, Murase K. MWMOTE\u0026ndash;majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng. 2012;26(2):405\u0026ndash;25.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eL\u0026oacute;pez V, Fern\u0026aacute;ndez A, Garc\u0026iacute;a S, Palade V, Herrera F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf Sci. 2013;250:113\u0026ndash;41.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMa J, Wang S, Wang Z, Xu J. Protein contact prediction by integrating joint evolutionary coupling analysis and supervised learning, \u003cem\u003eBioinformatics\u003c/em\u003e, vol. 31, no. 21, pp. 3506\u0026ndash;3513, 2015.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePeng L, Xiao-yang Y, Ting-ting B, Jiu-ling H. Imbalanced data SVM classification method based on cluster boundary sampling and DT-KNN pruning. Int J Signal Process. 2014;7(2):61\u0026ndash;8. Image Processing and Pattern Recognition.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMazumder RU, Begum SA, Biswas D. Rough Fuzzy classification for class imbalanced data, in \u003cem\u003eProceedings of Fourth International Conference on Soft Computing for Problem Solving: SocProS 2014, Volume 1\u003c/em\u003e, 2014: Springer, pp. 159\u0026ndash;171.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Protein contact map prediction, imbalanced data classification, cluster-based undersampling, k-means clustering, multilayer perceptron, data balancing techniques, bioinformatics classification","lastPublishedDoi":"10.21203/rs.3.rs-7528788/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7528788/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eClass imbalance poses a significant challenge in binary classification tasks, particularly in bioinformatics applications such as protein contact map prediction. This study addresses the tendency of conventional classifiers to misclassify minority class instances by proposing a novel, intelligent data balancing method. The primary objective was to enhance prediction accuracy and sensitivity in protein contact map classification by employing a cluster-based undersampling technique integrated with a multilayer perceptron (MLP) neural network.\u003c/p\u003e\u003cp\u003eProtein sequence data were extracted from established bioinformatics databases, and the class imbalance was mitigated using a k-means clustering algorithm. The majority class samples were divided into clusters, and representative instances were selected based on their proximity to cluster boundaries and intra-cluster density. The resulting dataset, balanced at a 4:1 ratio, was used to train and evaluate an MLP classifier optimized with the Levenberg\u0026ndash;Marquardt algorithm. Performance metrics such as accuracy, sensitivity, precision, and F1-score were analyzed across multiple experimental groups.\u003c/p\u003e\u003cp\u003eThe proposed method outperformed conventional random undersampling, achieving an average accuracy of 93% and sensitivity of 61%, compared to 90% and 50% respectively for the baseline approach. These results demonstrate that cluster-aware sampling significantly improves classification performance in imbalanced datasets. This approach offers a practical and effective strategy for enhancing predictive models in bioinformatics. Future research may explore its integration with deep learning architectures and extension to multi-class prediction problems.\u003c/p\u003e","manuscriptTitle":"Cluster-Based Data Balancing for Protein Contact Map Prediction Using MLP Neural Networks","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-08 18:10:41","doi":"10.21203/rs.3.rs-7528788/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"237b0c31-ce14-4a91-a729-8a1f91ac0062","owner":[],"postedDate":"September 8th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-10-12T11:38:11+00:00","versionOfRecord":[],"versionCreatedAt":"2025-09-08 18:10:41","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7528788","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7528788","identity":"rs-7528788","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00