The Fraction-product: A Novel Discriminant Statistic for Binary Classification

doi:10.21203/rs.3.rs-8030267/v1

The Fraction-product: A Novel Discriminant Statistic for Binary Classification

2025 · doi:10.21203/rs.3.rs-8030267/v1

preprint OA: closed

Full text JSON View at publisher

Full text 118,194 characters · extracted from preprint-html · click to expand

The Fraction-product: A Novel Discriminant Statistic for Binary Classification | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article The Fraction-product: A Novel Discriminant Statistic for Binary Classification Frank A. Greco, Eugene B. Hanlon This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8030267/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background This paper characterizes the fraction-product as a novel discriminant statistic, which we have found to be extremely useful in feature selection on spectroscopic data. In supervised binary classification, the fraction-product measures the amount of taxonomic information in each attribute. The simplicity of the idea facilitates its adaptation to different data sets and, in some settings, leads to new, useful measures. After a discussion of its mathematical foundation, it is applied as a worked example to the Diagnostic Wisconsin Breast Cancer Database. Results The analysis of non-spectroscopic data suggests the utility of another new measure which is called taxonomic potential. Given two attributes, the taxonomic potential measures the potential for one feature to have taxonomic information that is not explained by its correlation with the other feature. The fraction-product and taxonomic potential allow the rapid selection of four features which, after weighting with linear discriminant analysis, lead to accuracy = 97.9%; recall = 1.0; precision = 92.2%. Moreover, the three major features are stable with respect to variations of the training set. Conclusions The fraction-product is a new discriminant statistic that has been useful in supervised, binary classification in two very different data sets: spectra and geometric measures of cell nuclei. It is simple and can be easily adapted to unique features of the data for the best outcomes. Biological sciences/Cancer Biological sciences/Computational biology and bioinformatics Physical sciences/Mathematics and computing binary classification discriminant feature selection nonparametric statistic Diagnostic Wisconsin Breast Cancer Database Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction This work grew out of attempts to use near-infrared reflectance spectroscopy in vivo to classify two groups of subjects: Alzheimer’s disease and elderly controls[ 1 – 6 ]. One exploratory approach in spectroscopy displays the spectra as intensity against wavelength using colors to indicate each class and allows one to look for regions in which the colors separate. We sought to automate this process for intensity at each detector pixel (wavelength) as follows: For class 1, determine the median value (M1) of the data. For class 2, determine the median value (M2) of the data. Estimate a classification cutoff as (M1 + M2)/2. Determine the fraction (f1) of class 1 data points on the same side of the cutoff as M1. Determine the fraction (f2) of class 2 data points on the same side of the cutoff as M2. It appeared that the arithmetic product of those fractions, f1×f2, could serve as a measure of how well detected light at that wavelength discriminated between the two classes. We called this discriminant statistic the “fraction-product” (abbreviated as “fp”) and calculated it as the 6th step in the above algorithm. We refer to the intuition associated with this discriminant measure as “taxonomic information.” The fraction-product facilitated the selection of those wavelengths with the greatest efficacy in classifying the subjects in our study[ 2 , 5 ], and we expect that supervised feature selection may be its most useful application. After selecting the features, a classification algorithm combines them, often weighting them in the process. Therefore, the rough calculation of the cut-off in step 3 for each feature should not degrade the final accuracy of their combination. Here we explore the general nature of the fraction-product and illustrate how it may usefully analyze a publicly available, non-spectroscopy dataset: the Diagnostic Wisconsin Breast Cancer Database[ 7 ]. The purpose of this paper is to communicate the fraction-product as a simple idea and to illustrate how it may be applied practically. After a presentation of the mathematical basis of the concepts, we proceed directly to the application. Typically, the primary concern of feature selection is the ability of the classification algorithm to optimize some measure of performance, that is, to find the best solution to the classification problem. Common secondary aims are to keep the features as uncorrelated as possible and their number as small as possible. These additional goals lead to a new measure that we will call “taxonomic potential.” Finally, we will discuss some applications that are useful in spectroscopy but are not likely to generalize to other fields. Materials and methods Theory As the fields of machine learning and modern data analysis add their unique literatures to those of the more venerable statistics and theory of errors of observations, the terminology common to all four fields may take slightly different connotations. Here we use the term “statistic” in its more recent sense of “a numerical characteristic of a sample”[ 8 ] without Fisher’s earlier implication that its purpose is to estimate a population parameter[ 9 ]. The words “feature” and “attribute” have been used interchangeably. We will follow a common practice of giving “feature” the connotation of “a special attribute.” Since the values of all attributes except “class/diagnosis” are numeric, they are “variates.” We shall use the term “information” in its everyday, nontechnical sense. As a shorthand, we shall refer to a datum falling on the same side of the cut-off as its group median as being “correctly allotted” or “correctly classified” by that attribute. Classification studies begin with the random selection of individuals from defined populations, which introduces sampling error, and the examination of various attributes that characterize those individuals. In our case, the attributes are numeric and can order or rank the data points. Figure 1 illustrates this process for two attributes (X1 and X2) as well as the application of the above algorithm to the sample. For a given data sample and attribute, f1 is the probability that a data point randomly selected from group 1 in the sample will be correctly allotted and similarly for f2. It follows that if two data points be drawn from the sample, one randomly from group 1 and the other randomly from group 2, then the fraction-product is the probability that both will be correctly allotted. Thus, the fraction-product measures the amount of “taxonomic information” in that attribute, which is the desideratum. In order to compare the relative taxonomic efficacy of two attributes in the same sample, the fp-value of each suffices. When the classification algorithm weights and combines features, it is useful to include attributes with the maximum values of f1 and f2. Intuitively, f1 and f2 also contain taxonomic information, although of a slightly different nature from that in the fraction-product; f1 and f2 assess the effect of the cut-off on each subset separately, whereas f1×f2 assesses the effect on the whole sample. The association of the fraction-product with a probability based on drawing one subject from each group in the sample clarifies its relationship to other similar measures. For example, the area under the curve of the receiver operating characteristic is also a discriminant statistic that measures a probability concerning two subjects drawn randomly, one from each of the two groups; however, it corresponds to the probability that the two points will be in the same order as their group medians[ 10 , 11 ]. Being “correctly classified” is a conceptually more appealing taxonomic measure than being “correctly ordered.” Likewise, accuracy, the percentage of subjects correctly classified, is similar to the fraction-product. However, accuracy is the probability that one subject drawn randomly from the entire sample will be correctly allotted, which corresponds to a probabilistic process different from the fraction-product and which is affected by prevalence. A simple example described in § 1 of the Supplementary Materials further clarifies this point and shows the influence of prevalence. As a statistic, the fraction-product is non-parametric according to the common use of that term. Furthermore, its utility does not depend on the distribution of the population from which the sample is drawn. General properties of the fraction-product The range of the fraction-product is between 0.25 (no taxonomic information) and 1.0 (complete separation of the two groups). It is always positive, which serves well to quantify the amount of taxonomic information in the attribute. To expand on this point, if we fix the difference between M1 and M2 as M1─M2. If M1 < M2, the sign of the difference becomes negative, but f1 and f2 remain positive numbers. Therefore, the minimum possible value of f1 and f2 is 0.5, and the minimum value of their product is 0.25. As the product of ratios of two natural numbers, the fraction-product has no units and may be viewed as a pure number. However, any fraction-product is tightly linked to the attribute and the two classes that generated it. In medical applications, the number of subjects may be small, and the fraction-product can take on only a few values. As an extreme example, if each group have 5 subjects, then f1 and f2 can assume only the values 0.6, 0.8 and 1.0. Therefore, the fraction-product can take on one of six values: 0.36, 0.48, 0.6, 0.64, 0.8, 1.0; the step between 0.6 and 0.64 is much smaller than the other steps, which can lead to unexpected clustering of data points. As this example makes clear, discreteness will be much more apparent if f1 and f2 are used as statistics than it will be for the fraction-product. Furthermore, the minimum value of the fraction-product in this extreme example is 0.36, not 0.25. The approach described here can be tailored to apply to small N, say 10 or 15 subjects in each group, but in what follows we will assume that N is large enough that discreteness, although always present, has no significant influence. Distribution of the null hypothesis for an attribute It is necessary to get a sense of how likely a particular fp-value is to occur completely by chance, i.e., by random sampling. The question here is whether a given attribute reveals the two populations. The fraction-product allows the null hypothesis for a particular attribute – that the distributions of Class 1 and Class 2 are identical – to be reworded as fp = 0.25 for that attribute. Given the number of subjects in each group of a sample, we are not aware of any method to calculate the distribution of fp-values, even if the population distributions are known or assumed. Therefore, we have approached this issue through simulations, usually assuming normal distributions for both classes in the simulation. Typical results are shown in Fig. 2 . Examination of the scales of the two x-axes leads to the most apparent conclusion: the smaller the number of subjects in each class, the broader the spread of the fp-values from the true value of 0.25. As expected, the distributions are not Gaussian. Although p-values are usually not used in feature selection, they are necessary for evaluating extreme fp-values because the hard lower limit of 0.25 precludes the use of a measure like confidence limits. For normally distributed variates, an extreme value may be defined as being outside the 3SD limits of the distribution. This is equivalent to the requirement that the datum must have a probability (p-value) < 0.003 of coming from the distribution. We usually use p < 0.001 as the criterion for an extreme value. Simulations demonstrate that with 100 subjects in each class, an attribute with an fp-value ≥ 0.38 would be extreme by this criterion. With 15 subjects in each class, an attribute with an fp-value ≥ 0.64 would be extreme. These values comport with the distributions shown in Fig. 2 . Similarly, as shown in the caption of Fig. 2 , the p-value associated with fp ≥ 0.4 in panel A is 0.00018. In Panel B, the p-value associated with fp ≥ 0.4 is 0.139. Therefore, if there are 100 individuals in each class in the sample, an attribute with an fp-value of 0.4 may not have significant taxonomic information, but it most likely did not occur by chance. If there are only 15 individuals in each class in the sample, then an attribute with an fp-value of 0.4 would not be attractive; it not only has little taxonomic information but also has a high probability of being due to a random fluctuation from sampling. In practice, a good discriminant feature will have an fp-value that is extreme with respect to the distribution from the null hypothesis for an attribute. This is crucial for studies with relatively small numbers of subjects. This is also the reason we have used the term “extreme values” rather than “outlier,” which connotes something to be excluded rather than sought. The worked application below will further clarify these trade-offs (Fig. 4 ). Analysis of the Diagnostic Wisconsin Breast Cancer Database The Diagnostic Wisconsin Breast Cancer Database[ 7 ] consists of thirty numerical attributes measured on the cell nuclei of breast tissue obtained by fine needle aspiration. The irregularity of the nuclei is an important diagnostic factor, and this was assessed by fitting a spline curve to the apparent boundary of each cell’s nucleus. Various geometric properties of the curve – such as area, perimeter, fractal dimension, etc. – were determined on the population of nuclei on a single slide. Given that malignant cells may be only a small portion of those studied on each slide, the “worst” values and standard errors (designated as “_se” in name) were used as attributes in addition to the mean. Therefore, ten geometric properties became thirty attributes assigned to each slide for analysis. Some attributes, like area and perimeter, are expected to be correlated. The classification was binary: benign or malignant. The dataset contained 212 malignant specimens (Class 1) and 357 benign specimens (Class 2). General analytic approach to feature selection All statistical computations were done using R[ 12 ]. Supplementary Material § 2 is the .Rhistory file of this analysis, annotated for those not familiar with R. Here we describe the procedure in general terminology. Linear discriminant analysis[ 13 , 14 ] will be used as the diagnostic algorithm after feature selection. After downloading the data set from the online website, preliminary steps are to load the .csv file into R and to remove the patient identifier number. The first column is now the diagnosis: malignant (M) or benign (B). Next, random samples of 141 slides classified as benign and 141 slides classified as malignant are drawn. For reproducibility, a seed is set before taking each random sample. When combined, the two subsets form the training set, which is about 42% of the whole data set. Although this is less than that commonly used for training, the reason for this choice will become clear when the variability of features selected is studied later (see below Stability of feature selection ). 1. Perform the simulation to determine the distribution of the null hypothesis for the training set with 141 subjects in each group. As above, the simulation assumed two normally distributed variates with mean = 0 and sd = 1. Because we will use linear discriminant analysis to weight the features selected for the diagnostic algorithm, the distribution of f1, which is identical to f2 for the null hypothesis, and the fraction-product are displayed in Fig. 3 . The criterion for extreme values of the fraction-product at the p 0.359 and that for f1, f2 is f1, f2 > 0.605 2. For the two groups in the training set, determine values of fp, f1, f2, and M1─M2 for all 30 attributes. The malignant-subset is Class 1, and the benign-subset is Class 2. We call the array that stores these values the “fraction-product data frame.” Because linear discriminant analysis assigns weights to the features, fp, f1 and f2 will be used as statistics and their distributions over all 30 attributes are shown in Fig. 4 . 3. Remove attributes with no information. Five attributes had fp < 0.359 and were removed: fractal_dimension_mean, fractal_dimension_se, texture_se, smoothness_se and symmetry_se. All subsequent operations were performed on this second fraction-product data frame with 25 attributes. The fp-value alone determines whether an attribute is removed rather than f1 or f2. The reason is that an attribute may, for example, have f1 = 0.5 and f2 = 0.9 which entail fp = 0.45; it would be premature to remove it because of f1 < 0.605 alone when both fp and f2 suggest retaining. 4. Determine the attributes with maximum values of fp, f1 and f2 and select as features. In this analysis, only one attribute attained the maximum value for each statistic; the first three features are radius_worst, concave.points_worst and area_worst respectively. To keep the presentation general, we will refer to these features as maxfp.feat, maxf1.feat and maxf2.feat instead of their particular names. 5. Determine the correlation coefficients of the features maxfp.feat, maxf1.feat and maxf2.feat with all attributes. Histograms for the three features are shown in Fig. 5 . The histogram for feature maxf1.feat is remarkable, showing a clear separation of attributes into two groups: highly correlated with f1 and uncorrelated with f1. We will return to this fact later to devise a method that takes advantage of it. Here we proceed with a more systematic and general approach. 6. Determine the taxonomic potential for each entry in the 3x25 array of correlation coefficients. For any two attributes A and B, 1- \(\:{\text{r}}_{\text{A}\text{B}}^{2}\:\) is a rough measure of the variation of A and B that cannot be explained by their linear relationship. Denoting the fraction-product of attribute A as fp A , the expression fp A ×(1- \(\:{\text{r}}_{\text{A}\text{B}}^{2}\) ) is a measure of the potential for the taxonomic information in A (fp A ) to not be explained by that in B. Similarly, the expression fp B ×(1- \(\:{\text{r}}_{\text{A}\text{B}}^{2}\) ) is a measure of the potential for the taxonomic information in B to not be explained by that in A. Two points deserve emphasis. First, if there is little taxonomic information in A, there will be little potential regardless of how much of the variation in A cannot be explained by its correlation with B. Second, if most of the variation in A can be explained by B, there will be little potential regardless of how much taxonomic information there is. Clearly, if A = B, the potential is 0. Since we are using fp, f1 and f2, there are three measures of taxonomic information. Therefore, for any two attributes, we define the taxonomic potential of attribute A with respect to attribute B to be pot.B(A) = tax(A)(1- \(\:{\text{r}}_{\text{A}\text{B}}^{2}\) ) where tax(A) is a measure of the taxonomic information in A and could be fp, f1 or f2. Letting X be a variable whose domain is the set of 25 attributes, we compute pot.maxfp.feat(X) = fp X (1- r(X,maxfp.feat) 2 ) pot.maxf1.feat(X) = f1 X (1- r(X,maxf1.feat) 2 ) pot.maxf2.feat(X) = f2 X (1- r(X,maxf2.feat) 2 ) We use the function input notation for Pearson’s r(attribute 1, attribute 2) instead of the commonly used subscript notation for single letter variables (r xy ) because the names here are simply too long. The features with the maximum values of the three taxonomic potentials are respectively smoothness_worst, texture_mean, and smoothness_worst, which adds two additional features to the initial three. 7. Perform linear discriminant analysis (LDA) on the training and test sets using five features: radius_worst, concave.points_worst, area_worst, smoothness_worst and texture_mean. The weights (scalings) are determined in the training set. The summary results are obtained by treating the data of each set as unknowns and comparing the predicted class with the diagnosis. The Diagnostic Wisconsin Breast Cancer Database website[ 7 ] lists accuracy and precision as performance measures for comparison of various methods. In the diagnosis of cancer, much importance is placed on not missing a malignancy; for this reason, we also compute recall as a measure of performance. The results are entered as LDA 1 in Table 1 . We note that once the features are selected, they are weighted by LDA according to their ability to separate the two classes, regardless of the reason why each one was selected. These weights are given in Table 2 . Furthermore, the customary word “training” is not accurate for linear discriminant analysis because it is determinative on whatever subset it is applied to; it is not trained. Therefore, it is useful to compare the performance of the training and test sets, especially because the prevalence changes from 50% to 24.7%. 7. Because of the natural division of attributes by maxf1.feat into correlated and uncorrelated groups ( Fig. 5 ), we consider an alternative to the taxonomic potential that was calculated systematically above. Any attribute with Pearson’s r < 0.65 may be considered as uncorrelated with maxf1.feat. Then we can simply ask: among the attributes uncorrelated with maxf1.feat, which has the maximum value of f1? Texture_worst is the unique answer. We then perform LDA using four features: radius_worst, concave.points_worst, area_worst, and texture_worst. The summary results are also given in Table 1 (LDA 2) and include the encouraging recall-value of 1 for the test set. The weights are listed in Table 2 . Table 1 Summary Results of Linear Discriminant Analysis Number of Features Accuracy Accuracy CI Recall Precision LDA 1 training 5 0.957 0.927–0.978 0.95 0.96 LDA 1 test 5 0.976 0.950–0.990 0.97 0.93 LDA 2 training 4 0.958 0.927–0.978 0.94 0.97 LDA 2 test 4 0.979 0.955–0.992 1.00 0.92 Table 2 Weights assigned to features Why selected LDA 1 LDA 2 concave.points_worst maxf1.feat 8.4024 12.3059 area_worst maxf2.feat -0.0037 -0.0033 radius_worst maxfp.feat 0.6268 0.5545 smoothness_worst uncorrelated with max(f2&fp).feat 13.8635 texture_mean uncorrelated with maxf1.feat 0.1083 texture_worst uncorrelated with maxf1.feat 0.0761 The stability of the features selected The annotated .Rhistory file for this analysis is included in the Supplementary Material § 3. There are 212 malignant specimens and 357 benign specimens. This allows us to divide the malignant specimens into three subsets of 71, 71 and 70 slides the benign specimens into five subsets of 71, 71, 71, 72, and 72 slides. By choosing two subsets from the malignant set and two subsets from the benign set, we can generate 30 different training sets of roughly 141 benign and 141 malignant specimens. In this way, we can vary the composition of the benign and malignant training subsets while keeping their numbers nearly identical to those of our first analysis. The anticipation of this analysis determined the choice of 141 elements in the training subsets used in feature selection above. We limit the study of stability to the features named maxfp.feat, maxf1.feat, and maxf2.feat. The results are summarized in Table 3 . For the fraction-product, only one feature attained the maximum value for each analysis; however, over the 30 combinations there were four features selected as maxfp.feat with the distribution indicated in the table. All 30 combinations returned concave.points_worst as maxf1.feat. For maxf2.feat, all 30 combinations included area_worst as the feature. It was the unique feature for 23 out the 30 combinations; for 6 there was one other feature that attained the maximum value of f2, and for one there were three features whose f2-values were the maximum. Therefore, the probability that concave.points_worst will be chosen as maxf1.feat approaches 1. The probability that area_worst will be chosen as maxf2.feat is minimally 23/30 = 0.78 and could approach 1 depending upon what rules are implemented when two or three attributes tie for the maximum f2-value. These decisions have nothing to do with the fraction-product per se and are the reason we limit the study of stability to these three features. The Jaccard index determined for the 30 combinations are as follows: maxfp.feat – 0.33; maxf1.feat – 1.0; maxf2.feat – 0.79. Table 3 Number of times feature selected in 30 combinations maxfp.feat maxf1.feat maxf2.feat total Concave.points_worst 13 30 0 43 Area_worst 0 0 30 30 Perimeter_worst 11 0 1 12 Concave.points_mean 5 0 0 5 Radius_worst 1 0 5 6 Area_se 0 0 2 2 Total 30 30 38 Jaccard index 0.33 1.0 0.79 Table 4 contains the correlation matrix for these six features and indicates that they are all correlated. Those features with correlation coefficients above 0.9 may be viewed as roughly equivalent. Within this perspective, the features fall into two groups. One includes concave.points_worst and concave.points_mean. The other includes area_worst, perimeter_worst, and radius_worst. These two groups correspond to maxf1.feat and maxf2.feat. From this broader viewpoint, the features are extremely stable. Table 4 Correlation matrix of all features selected from 30 combinations ccav.pt_wst ar_wst pr_wst ccav.pt_mn rd_wst ar_se concave.points_worst 1.0000 0.7474 0.8163 0.9101 0.7874 0.5381 area_worst 0.7474 1.0000 0.9775 0.8096 0.9840 0.8114 perimeter_worst 0.8163 0.9775 1.0000 0.8559 0.9937 0.7612 concave.points_mean 0.9101 0.8096 0.8559 1.0000 0.8303 0.6902 radius_worst 0.7874 0.9840 0.9937 0.8303 1.0000 0.7573 area_se 0.5381 0.8114 0.7612 0.6902 0.7573 1.0000 Note: Row and column names are identical; for display the column names are shortened Key: correlates with maxf1.feat ; correlates with maxf2.feat. Applications specific to spectroscopy Applications of spectroscopy to classification problems historically involve chemometric analysis in which measured spectra are compared to those from known materials in order to characterize the unknown materials chemically[ 15 ]. More recently, computer vision analyses have facilitated the use of machine learning techniques[ 16 , 17 ]. Our approach examines spectral features without regard to standard materials; it seeks simply to distinguish the two classes by comparing the shapes of the spectra[ 2 ]. We found that area-normalized first derivatives of the intensity spectra performed best[ 2 , 5 ]. Most modern spectrometers disperse the incoming light onto a digital detector, which associates the intensity measured at each pixel or bin with a wavelength. In our case, the detector had 1024 pixels, and we viewed each spectrum as a collection of 1024 attributes to be used for classification. Feature selection started with calculating the fraction-product at each pixel on a training set consisting of diseased and non-diseased subjects. However, optical features have linewidth (extent over wavelength), which added another criterion for feature selection: the fp-values must indicate significant discrimination on at least three contiguous pixels[ 2 , 5 ]. Consideration of how to measure taxonomic information over several contiguous pixels led to the notion of taxonomic signal, that is, fp/0.25[ 4 , 5 ]. Intuitively, this number measures the amount of taxonomic information reported by the attribute using no information as a reference state. It ranges between 1 and 4. When computed on several contiguous pixels, the product of these values will vary according as the value on each pixel but will always increase with the number of pixels, thereby serving as a measure of how well the entire region discriminates between the two classes[ 4 , 5 ]. In contrast, the product of unscaled fp-values will always decrease with increasing number of pixels, which makes it a counterintuitive measure at best. We discuss these concepts in more detail elsewhere[ 4 , 5 ]. Although they are extremely effective for optical spectra, they are unlikely to be useful in most other analyses. Discussion Diagnostic Wisconsin Breast Cancer Database The webpage of the Diagnostic Wisconsin Breast Cancer Database lists five techniques that have been applied to the data: xgboost, support vector, random forest, neural network, and logistic regression. Reported accuracies range from 92.3% to 97.9%, and our test set results of 97.7% overall is certainly comparable to the best performance. Precisions vary between 91.6% and 97.9 with ours being about 93%. Three points deserve attention. First, our training set was 42% of the whole dataset and our test set was 58%, the opposite of the design of most studies. Although this was chosen to allow the study of feature stability, in our experience the efficacy remains high even for smaller training sets, which is the best that can be said at this stage of development. We attribute this in part to matching the number of the malignant and benign slides in the training set, but this is likely to vary with the data set. Second, the prevalence of malignancy in the whole data set was 37%; our test set had a lower prevalence (24.7%) yet our method performed as well on the test set as on the training set with a prevalence of 50%. This fact suggests that our method’s performance is relatively independent of prevalence and that the features approach our goal of being pathognomonic. Third, our purpose is simply to illustrate the approach, not to contribute to the diagnosis of breast cancer. The motivation behind keeping the number of features to a minimum is partly esthetic but primarily practical. The smaller the number of features, the easier it is to note relationships with other variables, for example, between a feature and clinical course. Furthermore, any method put into clinical practice must have quality assurance procedures. A small number of features not only facilitates the development of quality assurance procedures and but also simplifies troubleshooting when measurements are out of control. Four or five variables may be studied systematically to isolate problems with research methods, data analysis or manufacturing techniques; thirty variables cannot. Stability of feature selection The practical issues arising from devising quality assurance procedures motivated our study of feature stability. Kalousis and colleagues were among the first to address the issue of the stability of feature selection algorithms[ 18 ], and their seminal work remains an insightful approach to the problem. Nogueira et al.[ 19 ] have reviewed and consolidated the numerous measures that have been suggested, and Bommert and Lang[ 20 ] have created an open source software package that implements the commonly used methods, which we used to compute the Jaccard index for the first three features selected. By the Jaccard index, maxf1.feat (1.0) and maxf2.feat (0.79) are considered stable, whereas maxfp.feat (0.33) is not stable. For our purposes, we note that there is a significant difference between claiming that four features are sufficient to classify nuclei and choosing four features for quality assurance. The former requires a rigorous separation of training and test sets; there is no reason why the latter could not be done on the entire data set after the fact. The choice of concave.points_worst for maxf1.feat and area_worst for maxf2.feat could be made by examination without any computation because they attain the maximal values for those two features in all of the variations of the training sets. In both applications of LDA, maxf1.feat (concave.points_worst) was heavily weighted and played a significant role in classifying slides. This is not surprising because malignant nuclei are highly irregular; the nuclei with the largest number of points with concave curvature clearly distinguish them from normal cells. Because there were fewer malignant cases, our method of combining the subsets of malignant slides (3 subsets) achieved less variation among the malignant cases as input than among the benign cases (5 subsets). Therefore, the stability of maxf1.feat might be overly optimistic because of the lesser variation of input. Conclusion The fraction-product is a novel discriminant statistic based on a simple concept, which allows it to be adapted easily to different datasets. More importantly, its simplicity facilitates the development of additional concepts that enhance its utility: optical taxonomic signal for spectroscopic data and taxonomic potential for correlated attributes. Data availability The data are available from Breast Cancer Wisconsin (Diagnostic). Declarations Data availability The data are available from Breast Cancer Wisconsin (Diagnostic). Acknowledgements The authors thank Dr. J. C. Huetter for general discussions concerning the fraction-product and Drs. Linsey McColl (MD, Select Statistical Services, Ltd), Dane Netherton and Brent Schell for critically reading the manuscript. Although this work was supported in part by the US Department of Veterans Affairs, all opinions are the authors’. Funding This work was supported in part by an unrestricted grant from Headwall Photonics, Inc (Frank A. Greco), by Merit Review Award I01 CX000827 from the U.S. Department of Veterans Affairs, Clinical Sciences Research and Development Service (Eugene B. Hanlon) and generally by the U.S. Department of Veterans Affairs. Author information Authors and Affiliations Research Service (151B), Edith Nourse Rogers Memorial Veterans Hospital, Bedford, MA 01730 USA Frank A. Greco and Eugene B. Hanlon Contributions Both authors participated in the design of this work and in the writing of the manuscript. Corresponding author Correspondence to Frank A. Greco: [email protected] . Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Competing interests The authors declare no competing interests. References Hanlon, E. B. et al. Scattering differentiates Alzheimer disease in vitro. Opt. Lett. 33 (6), 624–626 (2008). Greco, F. A., McKee, A. C., Kowall, N. W. & Hanlon, E. B. Near-Infrared Optical Spectroscopy In Vivo Distinguishes Subjects with Alzheimer's Disease from Age-Matched Controls. J. Alzheimers Dis. 82 (2), 791–802 (2021). Greco, F. A., Hanlon, E. B. & EVALUATION OF BRAIN TISSUE AND MATERIAL BASED ON A FRACTION-PRODUCT AND OPTICAL SPECTROSCOPY.. US Patent Trademark Office US-20230175955-A1(US-20230175955-A1):1–46. (2023). Greco, F. A. Optical taxonomic signal. arXiv[physicsmed-ph] (2024). Greco, F. A., Schell, B. R. & Hanlon, E. B. Optical Taxonomic Signal and the Diagnosis of Alzheimer's Disease. IEEE Open. J. Eng. Med. Biol. 6 , 107–112 (2025). Hanlon, E. B. B., Greco, M. A. U. S., Bedford, F. A. & US):, M. A. SPECTROSCOPIC DETECTION OF BRAIN DAMAGE. In., vol. 10,405,751 B2. United States: The United States of America as represented by the Department of Veterans Affairs; (2019). Wolberg, W., Street, O. M. & Street, N. W: Breast Cancer Wisconsin (Diagnostic). In. UCI Machine Learning Repository: UCI Machine Learning Repository; (1993). Everitt, B. S. & Skrondal, A. The Cambridge Dictionary of Statistics Fourth edn (Cambridge University Press, 2010). Fisher, R. A. Methods for research workers (Hafner Publishing Company, 1950). Pepe, M. S. Receiver Operating Characteristic Methodology. J. Am. Stat. Assoc. 95 (449), 308–311 (2000). Bamber, D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J. Math. Psychol. 12 (4), 387–415 (1975). R Development Core Team. R: A Language and Environment for Statistical Computing. In., vol. 3.4.4. Vienna, Austria: R foundation for Statistical Computing; (2019). Fisher, R. A. The use of multiple measurements in taxonomic problems. Annals Eugenics . 7 (2), 179–188 (1936). NIST/SEMATECH e-Handbook of Statistical Methods. [ http://www.itl.nist.gov/div898/handbook] Richards-Kortum, R. & Sevick-Muraca, E. Quantitative optical spectroscopy for tissue diagnosis. Annu. Rev. Phys. Chem. 47 , 555–606 (1996). Gonzalez Viejo, C. et al. Development of a robotic pourer constructed with ubiquitous materials, open hardware and sensors to assess beer foam quality using computer vision and pattern recognition algorithms: RoboBEER. Food Res. Int. 89 , 504–513 (2016). Gonzalez Viejo, C., Fuentes, S., Torrico, D., Howell, K. & Dunshea, F. R. Assessment of beer quality based on foamability and chemical composition using computer vision algorithms, near infrared spectroscopy and machine learning algorithms. J. Sci. Food Agric. 98 (2), 618–627 (2018). Kalousis, A., Prados, J. & Hilario, M. Stability of feature selection algorithms. In: Fifth IEEE International Conference on Data Mining (ICDM'05): 2005 . IEEE: 8 pp. Nogueira, S., Sechidis, K. & Brown, G. On the stability of feature selection algorithms. J. Mach. Learn. Res. 18 (174), 1–54 (2018). Bommert, A. & Lang, M. stabm: Stability measures for feature selection. J. Open. Source Softw. 6 (59), 3010 (2021). Additional Declarations No competing interests reported. Supplementary Files SupplementaryMaterial.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8030267","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":545492247,"identity":"011588b2-271c-4e4c-a6b0-e40a471fd76b","order_by":0,"name":"Frank A. Greco","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA30lEQVRIie3NvQqCUBjG8VeCXA65KkLewnsI+hjSsdsQDjgVbSE0ZAS1eAF2F07NB87gci6g0ZbmpnBMI2gJtS3o/Kdn+fEAqFQ/mMcBtAjcamv5rQ1BeBJW7Q5NviC82l2btCK2uF4S4EtDF9ye7l1wZjHWEq8XjGgKfHKMA3+w2DOgUtYTJDC0csgQzwTZXHaAJoHfQPR7SQR6JRFjuWlDyNBKS4ImoTsIBTgm401kZSXI0JQB0+IwI0hEE9FPVhy6aBxEVhS47juHbVRLXvA9CdZ/fMpp86FSqVR/1QOLMT7u5CjV5wAAAABJRU5ErkJggg==","orcid":"","institution":"VA Bedford Healthcare System","correspondingAuthor":true,"prefix":"","firstName":"Frank","middleName":"A.","lastName":"Greco","suffix":""},{"id":545492252,"identity":"eea1c7fc-f98e-4d5b-951b-42cfc47aba86","order_by":1,"name":"Eugene B. Hanlon","email":"","orcid":"","institution":"VA Bedford Healthcare System","correspondingAuthor":false,"prefix":"","firstName":"Eugene","middleName":"B.","lastName":"Hanlon","suffix":""}],"badges":[],"createdAt":"2025-11-04 15:08:18","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8030267/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8030267/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":96281194,"identity":"4d5056d4-4763-49e4-8409-889fb7493743","added_by":"auto","created_at":"2025-11-19 11:19:45","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":419839,"visible":true,"origin":"","legend":"","description":"","filename":"GrecoHanlonfractionproduct.docx","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/e3f7f8c7f1c19a2ccab05166.docx"},{"id":96281198,"identity":"7625855a-5a81-4c43-b45a-ed2aa1162930","added_by":"auto","created_at":"2025-11-19 11:19:45","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":4581,"visible":true,"origin":"","legend":"","description":"","filename":"a0194331bc504daab221fb114c971c64.json","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/12f9ef361cadef71ce69c3ba.json"},{"id":96364647,"identity":"6b4d24e9-60f2-4057-8021-f5db214ebec7","added_by":"auto","created_at":"2025-11-20 10:09:30","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":72524,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryMaterial.docx","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/4289f7707aa893f7a26ae9b4.docx"},{"id":96364644,"identity":"05066305-353b-45b4-a975-b746f444c1f1","added_by":"auto","created_at":"2025-11-20 10:09:30","extension":"xml","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":85712,"visible":true,"origin":"","legend":"","description":"","filename":"a0194331bc504daab221fb114c971c641enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/f764393a01c143af7fc406eb.xml"},{"id":96364439,"identity":"f3013a26-dd93-43a4-bfcb-50801593f162","added_by":"auto","created_at":"2025-11-20 10:09:18","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":19335,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/a1d6826336c9420e570aba50.png"},{"id":96281203,"identity":"77a9f028-7cc5-4315-8f95-ea2c57e81240","added_by":"auto","created_at":"2025-11-19 11:19:45","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":172703,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/f81af42794bcb9cef9cc43ee.png"},{"id":96364548,"identity":"78e3e0d5-2e89-4b21-a930-0816583e5df4","added_by":"auto","created_at":"2025-11-20 10:09:24","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":10342,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/c4058570cc2438df5951da35.png"},{"id":96281204,"identity":"4d1a0bf5-06ee-4bb5-b779-20c7d5e4d823","added_by":"auto","created_at":"2025-11-19 11:19:45","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":15418,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/f5cec547e29c93ad1f574b03.png"},{"id":96281208,"identity":"53184302-393b-4584-b4f4-fcae55e4b647","added_by":"auto","created_at":"2025-11-19 11:19:45","extension":"png","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":19106,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/bf2baa651457d3f62a0143b7.png"},{"id":96281206,"identity":"1c7f00ef-0b47-43cc-b09a-3d2466ec08a7","added_by":"auto","created_at":"2025-11-19 11:19:45","extension":"xml","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":85911,"visible":true,"origin":"","legend":"","description":"","filename":"a0194331bc504daab221fb114c971c641structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/d1dd6e1cd4674b82bcb9e76a.xml"},{"id":96364161,"identity":"387d7fb0-9144-4579-9b63-b75b1976c65d","added_by":"auto","created_at":"2025-11-20 10:08:59","extension":"html","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":93592,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/06e9f3b2909307aafe8642c4.html"},{"id":96281192,"identity":"9f38e28f-636e-4b4d-a7d7-6797f3f0176f","added_by":"auto","created_at":"2025-11-19 11:19:45","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":45589,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eGeneral properties of fraction-product.\u003c/strong\u003eThe curves are two population distributions from which individuals are randomly drawn. Panel A illustrates the effective separation of the two classes by Attribute X1 (fraction-product value near 1). Panel B shows that Attribute X2 has no ability to separate the classes (fraction-product 0.25). The vertical bars mark the medians of the two classes within the sample; the caret indicates the cut-off calculated as step 3 in the algorithm. Abbreviations: “f1”=fraction of Class 1 correctly classified; “f2”=fraction of Class 2 correctly classified; “fp”=fraction-product=f1×f2.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/266e29d838722e57ea74d52a.png"},{"id":96363489,"identity":"3c565c47-ea07-440e-a4d8-02ee77c2a02a","added_by":"auto","created_at":"2025-11-20 10:07:08","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":159990,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDistribution of fp-values determined by simulation for the null hypothesis for a particular attribute.\u003c/strong\u003e Both simulations assume a normal distribution of attribute values and set mean=0, sd=1 for each of the two classes. In panel A, 100 data points were generated for each class; in panel B, 15 data points were generated for each class. One million iterations were performed. In panel A, although not apparent at the scale used, there are 180 points to the right of fp=0.4; therefore, the probability that fp ≥ 0.4 occurs by chance is 0.00018. In Panel B, there are 139,163 points with fp ≥ 0.4, which leads to an estimated p-value of 0.139 that an fp-value ≥ 0.4 would happen by chance. The asterisks mark the cutoff beyond which an fp-value is extreme by the p\u0026lt; 0.001 criterion.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/3dbb2c10988b9ef1893cda56.png"},{"id":96364155,"identity":"2a4bb8c1-6b02-4fe1-96f6-b4a460362a1d","added_by":"auto","created_at":"2025-11-20 10:08:59","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":29937,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eHistograms of the distributions of fp, f1 and f2 from the simulations for the null hypothesis with 141 subjects per class.\u003c/strong\u003e For the null hypothesis, f1 and f2 are interchangeable and their distributions are identical. The asterisks mark the cutoff at the p\u0026lt;0.001 level for a value to be extreme.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/275858ccf0c8874f23fd06e5.png"},{"id":96281200,"identity":"96aa1442-961d-4592-bce0-010eb3fb25f6","added_by":"auto","created_at":"2025-11-19 11:19:45","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":41106,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eHistogram of fp-values on training set, 141 subjects in each group.\u003c/strong\u003e The asterisk marks the cut-off (0.36) above which an fp-value would be considered extreme with respect to the distribution of the null hypothesis (see Figure 3); fp-values below 0.359 would be ruled-out as likely due to chance. Values of fp \u0026gt; 0.359 are not likely to be due to chance but may not be useful. A well-chosen set of discriminant attributes should have most attributes be extreme with respect to the distribution from the null hypothesis.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/d4e8c90b56b28063f26ddb72.png"},{"id":96281197,"identity":"9f7f3494-394f-4db5-9087-9cf40aa3aed3","added_by":"auto","created_at":"2025-11-19 11:19:45","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":54571,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eHistograms of Pearson’s r for each the features maxfp.feat, maxf1.feat, maxf2.feat with all attributes.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/a273ea9533ce660b6c98c3e1.png"},{"id":97141215,"identity":"27217e53-748b-43ff-be35-46a954e5de1d","added_by":"auto","created_at":"2025-12-01 10:06:26","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1897305,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/9803f188-5b2e-47fc-be94-2448293bc1f0.pdf"},{"id":96281202,"identity":"bd3cfe53-d0c2-4110-abf1-b054a523fb9a","added_by":"auto","created_at":"2025-11-19 11:19:45","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":72524,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryMaterial.docx","url":"https://assets-eu.researchsquare.com/files/rs-8030267/v1/5fbe9c8e08cabdea4ad51903.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"The Fraction-product: A Novel Discriminant Statistic for Binary Classification","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThis work grew out of attempts to use near-infrared reflectance spectroscopy \u003cem\u003ein vivo\u003c/em\u003e to classify two groups of subjects: Alzheimer\u0026rsquo;s disease and elderly controls[\u003cspan additionalcitationids=\"CR2 CR3 CR4 CR5\" citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. One exploratory approach in spectroscopy displays the spectra as intensity against wavelength using colors to indicate each class and allows one to look for regions in which the colors separate. We sought to automate this process for intensity at each detector pixel (wavelength) as follows:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eFor class 1, determine the median value (M1) of the data.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eFor class 2, determine the median value (M2) of the data.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eEstimate a classification cutoff as (M1\u0026thinsp;+\u0026thinsp;M2)/2.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eDetermine the fraction (f1) of class 1 data points on the same side of the cutoff as M1.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eDetermine the fraction (f2) of class 2 data points on the same side of the cutoff as M2.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eIt appeared that the arithmetic product of those fractions, f1\u0026times;f2, could serve as a measure of how well detected light at that wavelength discriminated between the two classes. We called this discriminant statistic the \u0026ldquo;fraction-product\u0026rdquo; (abbreviated as \u0026ldquo;fp\u0026rdquo;) and calculated it as the 6th step in the above algorithm. We refer to the intuition associated with this discriminant measure as \u0026ldquo;taxonomic information.\u0026rdquo;\u003c/p\u003e\u003cp\u003eThe fraction-product facilitated the selection of those wavelengths with the greatest efficacy in classifying the subjects in our study[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], and we expect that supervised feature selection may be its most useful application. After selecting the features, a classification algorithm combines them, often weighting them in the process. Therefore, the rough calculation of the cut-off in step 3 for each feature should not degrade the final accuracy of their combination. Here we explore the general nature of the fraction-product and illustrate how it may usefully analyze a publicly available, non-spectroscopy dataset: the Diagnostic Wisconsin Breast Cancer Database[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eThe purpose of this paper is to communicate the fraction-product as a simple idea and to illustrate how it may be applied practically. After a presentation of the mathematical basis of the concepts, we proceed directly to the application. Typically, the primary concern of feature selection is the ability of the classification algorithm to optimize some measure of performance, that is, to find the best solution to the classification problem. Common secondary aims are to keep the features as uncorrelated as possible and their number as small as possible. These additional goals lead to a new measure that we will call \u0026ldquo;taxonomic potential.\u0026rdquo; Finally, we will discuss some applications that are useful in spectroscopy but are not likely to generalize to other fields.\u003c/p\u003e"},{"header":"Materials and methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\n \u003ch2\u003eTheory\u003c/h2\u003e\n \u003cp\u003eAs the fields of machine learning and modern data analysis add their unique literatures to those of the more venerable statistics and theory of errors of observations, the terminology common to all four fields may take slightly different connotations. Here we use the term \u0026ldquo;statistic\u0026rdquo; in its more recent sense of \u0026ldquo;a numerical characteristic of a sample\u0026rdquo;[\u003cspan class=\"CitationRef\"\u003e8\u003c/span\u003e] without Fisher\u0026rsquo;s earlier implication that its purpose is to estimate a population parameter[\u003cspan class=\"CitationRef\"\u003e9\u003c/span\u003e]. The words \u0026ldquo;feature\u0026rdquo; and \u0026ldquo;attribute\u0026rdquo; have been used interchangeably. We will follow a common practice of giving \u0026ldquo;feature\u0026rdquo; the connotation of \u0026ldquo;a special attribute.\u0026rdquo; Since the values of all attributes except \u0026ldquo;class/diagnosis\u0026rdquo; are numeric, they are \u0026ldquo;variates.\u0026rdquo; We shall use the term \u0026ldquo;information\u0026rdquo; in its everyday, nontechnical sense. As a shorthand, we shall refer to a datum falling on the same side of the cut-off as its group median as being \u0026ldquo;correctly allotted\u0026rdquo; or \u0026ldquo;correctly classified\u0026rdquo; by that attribute.\u003c/p\u003e\n \u003cp\u003eClassification studies begin with the random selection of individuals from defined populations, which introduces sampling error, and the examination of various attributes that characterize those individuals. In our case, the attributes are numeric and can order or rank the data points. Figure \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e illustrates this process for two attributes (X1 and X2) as well as the application of the above algorithm to the sample.\u003c/p\u003e\n \u003cp\u003eFor a given data sample and attribute, f1 is the probability that a data point randomly selected from group 1 in the sample will be correctly allotted and similarly for f2. It follows that if two data points be drawn from the sample, one randomly from group 1 and the other randomly from group 2, then the fraction-product is the probability that both will be correctly allotted. Thus, the fraction-product measures the amount of \u0026ldquo;taxonomic information\u0026rdquo; in that attribute, which is the desideratum.\u003c/p\u003e\n \u003cp\u003eIn order to compare the relative taxonomic efficacy of two attributes in the same sample, the fp-value of each suffices. When the classification algorithm weights and combines features, it is useful to include attributes with the maximum values of f1 and f2. Intuitively, f1 and f2 also contain taxonomic information, although of a slightly different nature from that in the fraction-product; f1 and f2 assess the effect of the cut-off on each subset separately, whereas f1\u0026times;f2 assesses the effect on the whole sample.\u003c/p\u003e\n \u003cp\u003eThe association of the fraction-product with a probability based on drawing one subject from each group in the sample clarifies its relationship to other similar measures. For example, the area under the curve of the receiver operating characteristic is also a discriminant statistic that measures a probability concerning two subjects drawn randomly, one from each of the two groups; however, it corresponds to the probability that the two points will be in the same order as their group medians[\u003cspan class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e11\u003c/span\u003e]. Being \u0026ldquo;correctly classified\u0026rdquo; is a conceptually more appealing taxonomic measure than being \u0026ldquo;correctly ordered.\u0026rdquo; Likewise, accuracy, the percentage of subjects correctly classified, is similar to the fraction-product. However, accuracy is the probability that one subject drawn randomly from the entire sample will be correctly allotted, which corresponds to a probabilistic process different from the fraction-product and which is affected by prevalence. A simple example described in \u0026sect;\u0026nbsp;1 of the Supplementary Materials further clarifies this point and shows the influence of prevalence.\u003c/p\u003e\n \u003cp\u003eAs a statistic, the fraction-product is non-parametric according to the common use of that term. Furthermore, its utility does not depend on the distribution of the population from which the sample is drawn.\u003c/p\u003e\n\u003c/div\u003e\n\u003ch3\u003eGeneral properties of the fraction-product\u003c/h3\u003e\n\u003cp\u003eThe range of the fraction-product is between 0.25 (no taxonomic information) and 1.0 (complete separation of the two groups). It is always positive, which serves well to quantify the amount of taxonomic information in the attribute. To expand on this point, if we fix the difference between M1 and M2 as M1─M2. If M1\u0026thinsp;\u0026lt;\u0026thinsp;M2, the sign of the difference becomes negative, but f1 and f2 remain positive numbers. Therefore, the minimum possible value of f1 and f2 is 0.5, and the minimum value of their product is 0.25.\u003c/p\u003e\n\u003cp\u003eAs the product of ratios of two natural numbers, the fraction-product has no units and may be viewed as a pure number. However, any fraction-product is tightly linked to the attribute and the two classes that generated it.\u003c/p\u003e\n\u003cp\u003eIn medical applications, the number of subjects may be small, and the fraction-product can take on only a few values. As an extreme example, if each group have 5 subjects, then f1 and f2 can assume only the values 0.6, 0.8 and 1.0. Therefore, the fraction-product can take on one of six values: 0.36, 0.48, 0.6, 0.64, 0.8, 1.0; the step between 0.6 and 0.64 is much smaller than the other steps, which can lead to unexpected clustering of data points. As this example makes clear, discreteness will be much more apparent if f1 and f2 are used as statistics than it will be for the fraction-product. Furthermore, the minimum value of the fraction-product in this extreme example is 0.36, not 0.25. The approach described here can be tailored to apply to small N, say 10 or 15 subjects in each group, but in what follows we will assume that N is large enough that discreteness, although always present, has no significant influence.\u003c/p\u003e\n\u003ch3\u003eDistribution of the null hypothesis for an attribute\u003c/h3\u003e\n\u003cp\u003eIt is necessary to get a sense of how likely a particular fp-value is to occur completely by chance, i.e., by random sampling. The question here is whether a given attribute reveals the two populations. The fraction-product allows the null hypothesis for a particular attribute \u0026ndash; that the distributions of Class 1 and Class 2 are identical \u0026ndash; to be reworded as fp\u0026thinsp;=\u0026thinsp;0.25 for that attribute.\u003c/p\u003e\n\u003cp\u003eGiven the number of subjects in each group of a sample, we are not aware of any method to calculate the distribution of fp-values, even if the population distributions are known or assumed. Therefore, we have approached this issue through simulations, usually assuming normal distributions for both classes in the simulation. Typical results are shown in Fig. \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e. Examination of the scales of the two x-axes leads to the most apparent conclusion: the smaller the number of subjects in each class, the broader the spread of the fp-values from the true value of 0.25. As expected, the distributions are not Gaussian.\u003c/p\u003e\n\u003cp\u003eAlthough p-values are usually not used in feature selection, they are necessary for evaluating extreme fp-values because the hard lower limit of 0.25 precludes the use of a measure like confidence limits. For normally distributed variates, an extreme value may be defined as being outside the 3SD limits of the distribution. This is equivalent to the requirement that the datum must have a probability (p-value)\u0026thinsp;\u0026lt;\u0026thinsp;0.003 of coming from the distribution. We usually use p\u0026thinsp;\u0026lt;\u0026thinsp;0.001 as the criterion for an extreme value. Simulations demonstrate that with 100 subjects in each class, an attribute with an fp-value\u0026thinsp;\u0026ge;\u0026thinsp;0.38 would be extreme by this criterion. With 15 subjects in each class, an attribute with an fp-value\u0026thinsp;\u0026ge;\u0026thinsp;0.64 would be extreme. These values comport with the distributions shown in Fig. \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e\n\u003cp\u003eSimilarly, as shown in the caption of Fig. \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e, the p-value associated with fp\u0026thinsp;\u0026ge;\u0026thinsp;0.4 in panel A is 0.00018. In Panel B, the p-value associated with fp\u0026thinsp;\u0026ge;\u0026thinsp;0.4 is 0.139. Therefore, if there are 100 individuals in each class in the sample, an attribute with an fp-value of 0.4 may not have significant taxonomic information, but it most likely did not occur by chance. If there are only 15 individuals in each class in the sample, then an attribute with an fp-value of 0.4 would not be attractive; it not only has little taxonomic information but also has a high probability of being due to a random fluctuation from sampling.\u003c/p\u003e\n\u003cp\u003eIn practice, a good discriminant feature will have an fp-value that is extreme with respect to the distribution from the null hypothesis for an attribute. This is crucial for studies with relatively small numbers of subjects. This is also the reason we have used the term \u0026ldquo;extreme values\u0026rdquo; rather than \u0026ldquo;outlier,\u0026rdquo; which connotes something to be excluded rather than sought. The worked application below will further clarify these trade-offs (Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003e).\u003c/p\u003e\n\u003ch3\u003eAnalysis of the Diagnostic Wisconsin Breast Cancer Database\u003c/h3\u003e\n\u003cp\u003eThe Diagnostic Wisconsin Breast Cancer Database[\u003cspan class=\"CitationRef\"\u003e7\u003c/span\u003e] consists of thirty numerical attributes measured on the cell nuclei of breast tissue obtained by fine needle aspiration. The irregularity of the nuclei is an important diagnostic factor, and this was assessed by fitting a spline curve to the apparent boundary of each cell\u0026rsquo;s nucleus. Various geometric properties of the curve \u0026ndash; such as area, perimeter, fractal dimension, etc. \u0026ndash; were determined on the population of nuclei on a single slide. Given that malignant cells may be only a small portion of those studied on each slide, the \u0026ldquo;worst\u0026rdquo; values and standard errors (designated as \u0026ldquo;_se\u0026rdquo; in name) were used as attributes in addition to the mean. Therefore, ten geometric properties became thirty attributes assigned to each slide for analysis. Some attributes, like area and perimeter, are expected to be correlated. The classification was binary: benign or malignant. The dataset contained 212 malignant specimens (Class 1) and 357 benign specimens (Class 2).\u003c/p\u003e\n\u003ch3\u003eGeneral analytic approach to feature selection\u003c/h3\u003e\n\u003cp\u003eAll statistical computations were done using R[\u003cspan class=\"CitationRef\"\u003e12\u003c/span\u003e]. Supplementary Material \u0026sect;\u0026nbsp;2 is the .Rhistory file of this analysis, annotated for those not familiar with R. Here we describe the procedure in general terminology. Linear discriminant analysis[\u003cspan class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e14\u003c/span\u003e] will be used as the diagnostic algorithm after feature selection.\u003c/p\u003e\n\u003cp\u003eAfter downloading the data set from the online website, preliminary steps are to load the .csv file into R and to remove the patient identifier number. The first column is now the diagnosis: malignant (M) or benign (B). Next, random samples of 141 slides classified as benign and 141 slides classified as malignant are drawn. For reproducibility, a seed is set before taking each random sample. When combined, the two subsets form the training set, which is about 42% of the whole data set. Although this is less than that commonly used for training, the reason for this choice will become clear when the variability of features selected is studied later (see below \u003cstrong\u003eStability of feature selection\u003c/strong\u003e).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e1. Perform the simulation to determine the distribution of the null hypothesis for the training set with 141 subjects in each group.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAs above, the simulation assumed two normally distributed variates with mean\u0026thinsp;=\u0026thinsp;0 and sd\u0026thinsp;=\u0026thinsp;1. Because we will use linear discriminant analysis to weight the features selected for the diagnostic algorithm, the distribution of f1, which is identical to f2 for the null hypothesis, and the fraction-product are displayed in Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e. The criterion for extreme values of the fraction-product at the p\u0026thinsp;\u0026lt;\u0026thinsp;0.001 level is fp\u0026thinsp;\u0026gt;\u0026thinsp;0.359 and that for f1, f2 is f1, f2\u0026thinsp;\u0026gt;\u0026thinsp;0.605\u003c/p\u003e\n\u003cp\u003e2.\u0026nbsp;\u003cstrong\u003eFor the two groups in the training set, determine values of fp, f1, f2, and M1─M2 for all 30 attributes. The malignant-subset is Class 1, and the benign-subset is Class 2.\u003c/strong\u003e\u003c/p\u003e\n\u003cdiv class=\"BlockQuote\"\u003e\n \u003cp\u003eWe call the array that stores these values the \u0026ldquo;fraction-product data frame.\u0026rdquo; Because linear discriminant analysis assigns weights to the features, fp, f1 and f2 will be used as statistics and their distributions over all 30 attributes are shown in Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003e.\u003c/p\u003e\n\u003c/div\u003e\n\u003cp\u003e\u003cstrong\u003e3. Remove attributes with no information.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFive attributes had fp\u0026thinsp;\u0026lt;\u0026thinsp;0.359 and were removed: fractal_dimension_mean, fractal_dimension_se, texture_se, smoothness_se and symmetry_se. All subsequent operations were performed on this second fraction-product data frame with 25 attributes. The fp-value alone determines whether an attribute is removed rather than f1 or f2. The reason is that an attribute may, for example, have f1\u0026thinsp;=\u0026thinsp;0.5 and f2\u0026thinsp;=\u0026thinsp;0.9 which entail fp\u0026thinsp;=\u0026thinsp;0.45; it would be premature to remove it because of f1\u0026thinsp;\u0026lt;\u0026thinsp;0.605 alone when both fp and f2 suggest retaining.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e4. Determine the attributes with maximum values of fp, f1 and f2 and select as features.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIn this analysis, only one attribute attained the maximum value for each statistic; the first three features are radius_worst, concave.points_worst and area_worst respectively. To keep the presentation general, we will refer to these features as maxfp.feat, maxf1.feat and maxf2.feat instead of their particular names.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e5. Determine the correlation coefficients of the features maxfp.feat, maxf1.feat and maxf2.feat with all attributes.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eHistograms for the three features are shown in Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e. The histogram for feature maxf1.feat is remarkable, showing a clear separation of attributes into two groups: highly correlated with f1 and uncorrelated with f1. We will return to this fact later to devise a method that takes advantage of it. Here we proceed with a more systematic and general approach.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e6. Determine the taxonomic potential for each entry in the 3x25 array of correlation coefficients.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFor any two attributes A and B, 1- \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{r}}_{\\text{A}\\text{B}}^{2}\\:\\)\u003c/span\u003e\u003c/span\u003eis a rough measure of the variation of A and B that cannot be explained by their linear relationship. Denoting the fraction-product of attribute A as fp\u003csub\u003eA\u003c/sub\u003e, the expression fp\u003csub\u003eA\u003c/sub\u003e\u0026times;(1- \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{r}}_{\\text{A}\\text{B}}^{2}\\)\u003c/span\u003e\u003c/span\u003e) is a measure of the \u003cem\u003epotential\u003c/em\u003e for the taxonomic information in A (fp\u003csub\u003eA\u003c/sub\u003e) to not be explained by that in B. Similarly, the expression fp\u003csub\u003eB\u003c/sub\u003e\u0026times;(1- \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{r}}_{\\text{A}\\text{B}}^{2}\\)\u003c/span\u003e\u003c/span\u003e) is a measure of the \u003cem\u003epotential\u003c/em\u003e for the taxonomic information in B to not be explained by that in A. Two points deserve emphasis. First, if there is little taxonomic information in A, there will be little potential regardless of how much of the variation in A cannot be explained by its correlation with B. Second, if most of the variation in A can be explained by B, there will be little potential regardless of how much taxonomic information there is. Clearly, if A\u0026thinsp;=\u0026thinsp;B, the potential is 0.\u003c/p\u003e\n\u003cp\u003eSince we are using fp, f1 and f2, there are three measures of taxonomic information. Therefore, for any two attributes, we define the taxonomic potential of attribute A with respect to attribute B to be\u003c/p\u003e\n\u003cp\u003epot.B(A)\u0026thinsp;=\u0026thinsp;tax(A)(1- \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{r}}_{\\text{A}\\text{B}}^{2}\\)\u003c/span\u003e\u003c/span\u003e)\u003c/p\u003e\n\u003cp\u003ewhere tax(A) is a measure of the taxonomic information in A and could be fp, f1 or f2. Letting X be a variable whose domain is the set of 25 attributes, we compute\u003c/p\u003e\n\u003cp\u003epot.maxfp.feat(X)\u0026thinsp;=\u0026thinsp;fp\u003csub\u003eX\u003c/sub\u003e (1- r(X,maxfp.feat)\u003csup\u003e2\u003c/sup\u003e)\u003c/p\u003e\n\u003cp\u003epot.maxf1.feat(X)\u0026thinsp;=\u0026thinsp;f1\u003csub\u003eX\u003c/sub\u003e(1- r(X,maxf1.feat)\u003csup\u003e2\u003c/sup\u003e)\u003c/p\u003e\n\u003cp\u003epot.maxf2.feat(X)\u0026thinsp;=\u0026thinsp;f2\u003csub\u003eX\u003c/sub\u003e(1- r(X,maxf2.feat)\u003csup\u003e2\u003c/sup\u003e)\u003c/p\u003e\n\u003cp\u003eWe use the function input notation for Pearson\u0026rsquo;s r(attribute 1, attribute 2) instead of the commonly used subscript notation for single letter variables (r\u003csub\u003exy\u003c/sub\u003e) because the names here are simply too long. The features with the maximum values of the three taxonomic potentials are respectively smoothness_worst, texture_mean, and smoothness_worst, which adds two additional features to the initial three.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e7. Perform linear discriminant analysis (LDA) on the training and test sets using five features: radius_worst, concave.points_worst, area_worst, smoothness_worst and texture_mean.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe weights (scalings) are determined in the training set. The summary results are obtained by treating the data of each set as unknowns and comparing the predicted class with the diagnosis. The Diagnostic Wisconsin Breast Cancer Database website[\u003cspan class=\"CitationRef\"\u003e7\u003c/span\u003e] lists accuracy and precision as performance measures for comparison of various methods. In the diagnosis of cancer, much importance is placed on not missing a malignancy; for this reason, we also compute recall as a measure of performance. The results are entered as LDA 1 in Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e. We note that once the features are selected, they are weighted by LDA according to their ability to separate the two classes, regardless of the reason why each one was selected. These weights are given in Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e. Furthermore, the customary word \u0026ldquo;training\u0026rdquo; is not accurate for linear discriminant analysis because it is determinative on whatever subset it is applied to; it is not trained. Therefore, it is useful to compare the performance of the training and test sets, especially because the prevalence changes from 50% to 24.7%.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e7. Because of the natural division of attributes by maxf1.feat into correlated and uncorrelated groups (\u003c/strong\u003eFig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e\u003cstrong\u003e), we consider an alternative to the taxonomic potential that was calculated systematically above.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAny attribute with Pearson\u0026rsquo;s r\u0026thinsp;\u0026lt;\u0026thinsp;0.65 may be considered as uncorrelated with maxf1.feat. Then we can simply ask: among the attributes uncorrelated with maxf1.feat, which has the maximum value of f1? Texture_worst is the unique answer. We then perform LDA using four features: radius_worst, concave.points_worst, area_worst, and texture_worst. The summary results are also given in Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e (LDA 2) and include the encouraging recall-value of 1 for the test set. The weights are listed in Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e\n\u003cdiv class=\"gridtable\"\u003e\n \u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n \u003ctable id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eSummary Results of Linear Discriminant Analysis\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eNumber of Features\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAccuracy\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAccuracy CI\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eRecall\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003ePrecision\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLDA 1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003etraining\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.957\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.927\u0026ndash;0.978\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.95\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.96\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLDA 1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003etest\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.976\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.950\u0026ndash;0.990\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.97\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.93\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLDA 2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003etraining\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.958\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.927\u0026ndash;0.978\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.94\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.97\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLDA 2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003etest\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.979\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.955\u0026ndash;0.992\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e1.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.92\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003cdiv class=\"gridtable\"\u003e\n \u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n \u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n \u003ctable id=\"Tab2\" border=\"1\"\u003e\n \u003ccaption\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eWeights assigned to features\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eWhy selected\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eLDA 1\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eLDA 2\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003econcave.points_worst\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003emaxf1.feat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e8.4024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e12.3059\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003earea_worst\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003emaxf2.feat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e-0.0037\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e-0.0033\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eradius_worst\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003emaxfp.feat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.6268\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.5545\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003esmoothness_worst\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003euncorrelated with max(f2\u0026amp;fp).feat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e13.8635\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003etexture_mean\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003euncorrelated with maxf1.feat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.1083\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003etexture_worst\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003euncorrelated with maxf1.feat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.0761\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\n \u003ch2\u003eThe stability of the features selected\u003c/h2\u003e\n \u003cp\u003eThe annotated .Rhistory file for this analysis is included in the Supplementary Material \u0026sect;\u0026nbsp;3. There are 212 malignant specimens and 357 benign specimens. This allows us to divide the malignant specimens into three subsets of 71, 71 and 70 slides the benign specimens into five subsets of 71, 71, 71, 72, and 72 slides. By choosing two subsets from the malignant set and two subsets from the benign set, we can generate 30 different training sets of roughly 141 benign and 141 malignant specimens. In this way, we can vary the composition of the benign and malignant training subsets while keeping their numbers nearly identical to those of our first analysis. The anticipation of this analysis determined the choice of 141 elements in the training subsets used in feature selection above.\u003c/p\u003e\n \u003cp\u003eWe limit the study of stability to the features named maxfp.feat, maxf1.feat, and maxf2.feat. The results are summarized in Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e. For the fraction-product, only one feature attained the maximum value for each analysis; however, over the 30 combinations there were four features selected as maxfp.feat with the distribution indicated in the table. All 30 combinations returned concave.points_worst as maxf1.feat. For maxf2.feat, all 30 combinations included area_worst as the feature. It was the unique feature for 23 out the 30 combinations; for 6 there was one other feature that attained the maximum value of f2, and for one there were three features whose f2-values were the maximum.\u003c/p\u003e\n \u003cp\u003eTherefore, the probability that concave.points_worst will be chosen as maxf1.feat approaches 1. The probability that area_worst will be chosen as maxf2.feat is minimally 23/30\u0026thinsp;=\u0026thinsp;0.78 and could approach 1 depending upon what rules are implemented when two or three attributes tie for the maximum f2-value. These decisions have nothing to do with the fraction-product \u003cem\u003eper se\u003c/em\u003e and are the reason we limit the study of stability to these three features.\u003c/p\u003e\n \u003cp\u003eThe Jaccard index determined for the 30 combinations are as follows: maxfp.feat \u0026ndash; 0.33; maxf1.feat \u0026ndash; 1.0; maxf2.feat \u0026ndash; 0.79.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tab3\" border=\"1\"\u003e\n \u003ccaption\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eNumber of times feature selected in 30 combinations\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003emaxfp.feat\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003emaxf1.feat\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003emaxf2.feat\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003etotal\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eConcave.points_worst\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e43\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eArea_worst\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePerimeter_worst\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e11\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e12\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eConcave.points_mean\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRadius_worst\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eArea_se\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTotal\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e38\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eJaccard index\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.79\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003eTable \u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003e contains the correlation matrix for these six features and indicates that they are all correlated. Those features with correlation coefficients above 0.9 may be viewed as roughly equivalent. Within this perspective, the features fall into two groups. One includes concave.points_worst and concave.points_mean. The other includes area_worst, perimeter_worst, and radius_worst. These two groups correspond to maxf1.feat and maxf2.feat. From this broader viewpoint, the features are extremely stable.\u0026nbsp;\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tab4\" border=\"1\"\u003e\n \u003ccaption\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eCorrelation matrix of all features selected from 30 combinations\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eccav.pt_wst\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003ear_wst\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003epr_wst\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eccav.pt_mn\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003erd_wst\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003ear_se\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003econcave.points_worst\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.0000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.7474\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.8163\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.9101\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.7874\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.5381\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003earea_worst\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.7474\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.0000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.9775\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.8096\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.9840\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.8114\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eperimeter_worst\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.8163\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.9775\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.0000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.8559\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.9937\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.7612\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003econcave.points_mean\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.9101\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.8096\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.8559\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.0000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.8303\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.6902\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eradius_worst\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.7874\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.9840\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.9937\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.8303\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.0000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.7573\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003earea_se\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.5381\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.8114\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.7612\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.6902\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.7573\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.0000\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"7\" align=\"left\"\u003e\n \u003cp\u003eNote: Row and column names are identical; for display the column names are shortened\u003c/p\u003e\n \u003cp\u003eKey: \u003cstrong\u003ecorrelates with maxf1.feat\u003c/strong\u003e; \u003cstrong\u003ecorrelates with maxf2.feat.\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003ch3\u003eApplications specific to spectroscopy\u003c/h3\u003e\n\u003cp\u003eApplications of spectroscopy to classification problems historically involve chemometric analysis in which measured spectra are compared to those from known materials in order to characterize the unknown materials chemically[\u003cspan class=\"CitationRef\"\u003e15\u003c/span\u003e]. More recently, computer vision analyses have facilitated the use of machine learning techniques[\u003cspan class=\"CitationRef\"\u003e16\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e17\u003c/span\u003e]. Our approach examines spectral features without regard to standard materials; it seeks simply to distinguish the two classes by comparing the shapes of the spectra[\u003cspan class=\"CitationRef\"\u003e2\u003c/span\u003e]. We found that area-normalized first derivatives of the intensity spectra performed best[\u003cspan class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e5\u003c/span\u003e].\u003c/p\u003e\n\u003cp\u003eMost modern spectrometers disperse the incoming light onto a digital detector, which associates the intensity measured at each pixel or bin with a wavelength. In our case, the detector had 1024 pixels, and we viewed each spectrum as a collection of 1024 attributes to be used for classification. Feature selection started with calculating the fraction-product at each pixel on a training set consisting of diseased and non-diseased subjects. However, optical features have linewidth (extent over wavelength), which added another criterion for feature selection: the fp-values must indicate significant discrimination on at least three contiguous pixels[\u003cspan class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e5\u003c/span\u003e].\u003c/p\u003e\n\u003cp\u003eConsideration of how to measure taxonomic information over several contiguous pixels led to the notion of taxonomic signal, that is, fp/0.25[\u003cspan class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e5\u003c/span\u003e]. Intuitively, this number measures the amount of taxonomic information reported by the attribute using no information as a reference state. It ranges between 1 and 4. When computed on several contiguous pixels, the product of these values will vary according as the value on each pixel but will always increase with the number of pixels, thereby serving as a measure of how well the entire region discriminates between the two classes[\u003cspan class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e5\u003c/span\u003e]. In contrast, the product of unscaled fp-values will always decrease with increasing number of pixels, which makes it a counterintuitive measure at best. We discuss these concepts in more detail elsewhere[\u003cspan class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e5\u003c/span\u003e]. Although they are extremely effective for optical spectra, they are unlikely to be useful in most other analyses.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003eDiagnostic Wisconsin Breast Cancer Database\u003c/h2\u003e\u003cp\u003eThe webpage of the Diagnostic Wisconsin Breast Cancer Database lists five techniques that have been applied to the data: xgboost, support vector, random forest, neural network, and logistic regression. Reported accuracies range from 92.3% to 97.9%, and our test set results of 97.7% overall is certainly comparable to the best performance. Precisions vary between 91.6% and 97.9 with ours being about 93%.\u003c/p\u003e\u003cp\u003eThree points deserve attention. First, our training set was 42% of the whole dataset and our test set was 58%, the opposite of the design of most studies. Although this was chosen to allow the study of feature stability, in our experience the efficacy remains high even for smaller training sets, which is the best that can be said at this stage of development. We attribute this in part to matching the number of the malignant and benign slides in the training set, but this is likely to vary with the data set. Second, the prevalence of malignancy in the whole data set was 37%; our test set had a lower prevalence (24.7%) yet our method performed as well on the test set as on the training set with a prevalence of 50%. This fact suggests that our method\u0026rsquo;s performance is relatively independent of prevalence and that the features approach our goal of being pathognomonic. Third, our purpose is simply to illustrate the approach, not to contribute to the diagnosis of breast cancer.\u003c/p\u003e\u003cp\u003eThe motivation behind keeping the number of features to a minimum is partly esthetic but primarily practical. The smaller the number of features, the easier it is to note relationships with other variables, for example, between a feature and clinical course. Furthermore, any method put into clinical practice must have quality assurance procedures. A small number of features not only facilitates the development of quality assurance procedures and but also simplifies troubleshooting when measurements are out of control. Four or five variables may be studied systematically to isolate problems with research methods, data analysis or manufacturing techniques; thirty variables cannot.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003eStability of feature selection\u003c/h2\u003e\u003cp\u003eThe practical issues arising from devising quality assurance procedures motivated our study of feature stability. Kalousis and colleagues were among the first to address the issue of the stability of feature selection algorithms[\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e], and their seminal work remains an insightful approach to the problem. Nogueira et al.[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e] have reviewed and consolidated the numerous measures that have been suggested, and Bommert and Lang[\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e] have created an open source software package that implements the commonly used methods, which we used to compute the Jaccard index for the first three features selected. By the Jaccard index, maxf1.feat (1.0) and maxf2.feat (0.79) are considered stable, whereas maxfp.feat (0.33) is not stable. For our purposes, we note that there is a significant difference between claiming that four features are sufficient to classify nuclei and choosing four features for quality assurance. The former requires a rigorous separation of training and test sets; there is no reason why the latter could not be done on the entire data set after the fact. The choice of concave.points_worst for maxf1.feat and area_worst for maxf2.feat could be made by examination without any computation because they attain the maximal values for those two features in all of the variations of the training sets.\u003c/p\u003e\u003cp\u003eIn both applications of LDA, maxf1.feat (concave.points_worst) was heavily weighted and played a significant role in classifying slides. This is not surprising because malignant nuclei are highly irregular; the nuclei with the largest number of points with concave curvature clearly distinguish them from normal cells. Because there were fewer malignant cases, our method of combining the subsets of malignant slides (3 subsets) achieved less variation among the malignant cases as input than among the benign cases (5 subsets). Therefore, the stability of maxf1.feat might be overly optimistic because of the lesser variation of input.\u003c/p\u003e\u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThe fraction-product is a novel discriminant statistic based on a simple concept, which allows it to be adapted easily to different datasets. More importantly, its simplicity facilitates the development of additional concepts that enhance its utility: optical taxonomic signal for spectroscopic data and taxonomic potential for correlated attributes.\u003c/p\u003e\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\u003ch2\u003eData availability\u003c/h2\u003e\u003cp\u003eThe data are available from Breast Cancer Wisconsin (Diagnostic).\u003c/p\u003e\u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe data are available from Breast Cancer Wisconsin (Diagnostic).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors thank Dr. J. C. Huetter for general discussions concerning the fraction-product and Drs. Linsey McColl (MD, Select Statistical Services, Ltd), Dane Netherton and Brent Schell for critically reading the manuscript. Although this work was supported in part by the US Department of Veterans Affairs, all opinions are the authors\u0026rsquo;.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work was supported in part by an unrestricted grant from Headwall Photonics, Inc (Frank A. Greco), by Merit Review Award I01 CX000827 from the U.S. Department of Veterans Affairs, Clinical Sciences Research and Development Service (Eugene B. Hanlon) and generally by the U.S. Department of Veterans Affairs.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor information\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors and Affiliations\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eResearch Service (151B), Edith Nourse Rogers Memorial Veterans Hospital, Bedford, MA 01730 USA\u003c/p\u003e\n\u003cp\u003eFrank A. Greco and Eugene B. Hanlon\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eContributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eBoth authors participated in the design of this work and in the writing of the manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCorresponding author\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eCorrespondence to Frank A. Greco: [email protected].\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eHanlon, E. B. et al. Scattering differentiates Alzheimer disease in vitro. \u003cem\u003eOpt. Lett.\u003c/em\u003e \u003cb\u003e33\u003c/b\u003e (6), 624\u0026ndash;626 (2008).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGreco, F. A., McKee, A. C., Kowall, N. W. \u0026amp; Hanlon, E. B. Near-Infrared Optical Spectroscopy In Vivo Distinguishes Subjects with Alzheimer's Disease from Age-Matched Controls. \u003cem\u003eJ. Alzheimers Dis.\u003c/em\u003e \u003cb\u003e82\u003c/b\u003e (2), 791\u0026ndash;802 (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGreco, F. A., Hanlon, E. B. \u0026amp; EVALUATION OF BRAIN TISSUE AND MATERIAL BASED ON A FRACTION-PRODUCT AND OPTICAL SPECTROSCOPY.. \u003cem\u003eUS Patent Trademark Office\u003c/em\u003e US-20230175955-A1(US-20230175955-A1):1\u0026ndash;46. (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGreco, F. A. Optical taxonomic signal. \u003cem\u003earXiv[physicsmed-ph]\u003c/em\u003e (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGreco, F. A., Schell, B. R. \u0026amp; Hanlon, E. B. Optical Taxonomic Signal and the Diagnosis of Alzheimer's Disease. \u003cem\u003eIEEE Open. J. Eng. Med. Biol.\u003c/em\u003e \u003cb\u003e6\u003c/b\u003e, 107\u0026ndash;112 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHanlon, E. B. B., Greco, M. A. U. S., Bedford, F. A. \u0026amp; US):, M. A. SPECTROSCOPIC DETECTION OF BRAIN DAMAGE. In., vol. 10,405,751 B2. United States: The United States of America as represented by the Department of Veterans Affairs; (2019).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWolberg, W., Street, O. M. \u0026amp; Street, N. W: Breast Cancer Wisconsin (Diagnostic). In. UCI Machine Learning Repository: UCI Machine Learning Repository; (1993).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eEveritt, B. S. \u0026amp; Skrondal, A. \u003cem\u003eThe Cambridge Dictionary of Statistics\u003c/em\u003e Fourth edn (Cambridge University Press, 2010).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFisher, R. A. \u003cem\u003eMethods for research workers\u003c/em\u003e (Hafner Publishing Company, 1950).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePepe, M. S. Receiver Operating Characteristic Methodology. \u003cem\u003eJ. Am. Stat. Assoc.\u003c/em\u003e \u003cb\u003e95\u003c/b\u003e (449), 308\u0026ndash;311 (2000).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBamber, D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. \u003cem\u003eJ. Math. Psychol.\u003c/em\u003e \u003cb\u003e12\u003c/b\u003e (4), 387\u0026ndash;415 (1975).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eR Development Core Team. R: A Language and Environment for Statistical Computing. In., vol. 3.4.4. Vienna, Austria: R foundation for Statistical Computing; (2019).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFisher, R. A. The use of multiple measurements in taxonomic problems. \u003cem\u003eAnnals Eugenics\u003c/em\u003e. \u003cb\u003e7\u003c/b\u003e (2), 179\u0026ndash;188 (1936).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNIST/SEMATECH e-Handbook of Statistical Methods. [\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://www.itl.nist.gov/div898/handbook]\u003c/span\u003e\u003cspan address=\"http://www.itl.nist.gov/div898/handbook]\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRichards-Kortum, R. \u0026amp; Sevick-Muraca, E. Quantitative optical spectroscopy for tissue diagnosis. \u003cem\u003eAnnu. Rev. Phys. Chem.\u003c/em\u003e \u003cb\u003e47\u003c/b\u003e, 555\u0026ndash;606 (1996).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGonzalez Viejo, C. et al. Development of a robotic pourer constructed with ubiquitous materials, open hardware and sensors to assess beer foam quality using computer vision and pattern recognition algorithms: RoboBEER. \u003cem\u003eFood Res. Int.\u003c/em\u003e \u003cb\u003e89\u003c/b\u003e, 504\u0026ndash;513 (2016).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGonzalez Viejo, C., Fuentes, S., Torrico, D., Howell, K. \u0026amp; Dunshea, F. R. Assessment of beer quality based on foamability and chemical composition using computer vision algorithms, near infrared spectroscopy and machine learning algorithms. \u003cem\u003eJ. Sci. Food Agric.\u003c/em\u003e \u003cb\u003e98\u003c/b\u003e (2), 618\u0026ndash;627 (2018).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKalousis, A., Prados, J. \u0026amp; Hilario, M. Stability of feature selection algorithms. In: \u003cem\u003eFifth IEEE International Conference on Data Mining (ICDM'05): 2005\u003c/em\u003e. IEEE: 8 pp.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNogueira, S., Sechidis, K. \u0026amp; Brown, G. On the stability of feature selection algorithms. \u003cem\u003eJ. Mach. Learn. Res.\u003c/em\u003e \u003cb\u003e18\u003c/b\u003e (174), 1\u0026ndash;54 (2018).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBommert, A. \u0026amp; Lang, M. stabm: Stability measures for feature selection. \u003cem\u003eJ. Open. Source Softw.\u003c/em\u003e \u003cb\u003e6\u003c/b\u003e (59), 3010 (2021).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"binary classification, discriminant, feature selection, nonparametric statistic, Diagnostic Wisconsin Breast Cancer Database","lastPublishedDoi":"10.21203/rs.3.rs-8030267/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8030267/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e\u003cp\u003eThis paper characterizes the fraction-product as a novel discriminant statistic, which we have found to be extremely useful in feature selection on spectroscopic data. In supervised binary classification, the fraction-product measures the amount of taxonomic information in each attribute. The simplicity of the idea facilitates its adaptation to different data sets and, in some settings, leads to new, useful measures. After a discussion of its mathematical foundation, it is applied as a worked example to the Diagnostic Wisconsin Breast Cancer Database.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e\u003cp\u003eThe analysis of non-spectroscopic data suggests the utility of another new measure which is called taxonomic potential. Given two attributes, the taxonomic potential measures the potential for one feature to have taxonomic information that is not explained by its correlation with the other feature. The fraction-product and taxonomic potential allow the rapid selection of four features which, after weighting with linear discriminant analysis, lead to accuracy\u0026thinsp;=\u0026thinsp;97.9%; recall\u0026thinsp;=\u0026thinsp;1.0; precision\u0026thinsp;=\u0026thinsp;92.2%. Moreover, the three major features are stable with respect to variations of the training set.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e\u003cp\u003eThe fraction-product is a new discriminant statistic that has been useful in supervised, binary classification in two very different data sets: spectra and geometric measures of cell nuclei. It is simple and can be easily adapted to unique features of the data for the best outcomes.\u003c/p\u003e","manuscriptTitle":"The Fraction-product: A Novel Discriminant Statistic for Binary Classification","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-11-19 11:19:40","doi":"10.21203/rs.3.rs-8030267/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"44af2029-19b4-4908-b354-a73ba6ac3abc","owner":[],"postedDate":"November 19th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":58040777,"name":"Biological sciences/Cancer"},{"id":58040778,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":58040779,"name":"Physical sciences/Mathematics and computing"}],"tags":[],"updatedAt":"2025-12-01T03:38:47+00:00","versionOfRecord":[],"versionCreatedAt":"2025-11-19 11:19:40","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8030267","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8030267","identity":"rs-8030267","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00