Predicting future grant amounts using topic-level features | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Predicting future grant amounts using topic-level features Gard B Jenset This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8248518/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract This study addresses research funding allocation, in the form of a large-scale empirical investigation into the structural factors, at the topic-level, that predict funding levels in the form of aggregated grant amounts, using a lagged panel-data approach. The topic-centric focus sets this work apart from previous research on which features predict success for individual grant applications. Understanding the topic-level dynamics around research funding provides a crucial complement to the individual-grant perspective, with a potential for informing research strategy and scientific priorities. Employing a data-driven approach based on large-scale data across more than 1,100 topics covering over 130 million publications, the study demonstrates that in addition to topic size, signals of socio-economic impact in the form of links to patents and policy documents, as well as citation patterns and aspects of the researcher community active in the topic, are significant predictors of future funding levels. A model triangulation approach, combining conditional inference trees, linear mixed effects regression, and Random Forest, reveals a clear hierarch of predictor importance across different statistical models. Clear signals of socio-economic impact driving research funding emerge both from this hierarchy of effects, as well as an outlier analysis. Taken together, the results provide compelling evidence for the structural, topic-level traces of socio-economic impact that influence research grants. Information Retrieval and Management scientific topics scientometrics research funding science of science Figures Figure 1 Figure 2 1. Introduction Research funding is a crucial pre-requisite for advances in innovation and research (Bol et al., 2018; Győrffy et al., 2020; Zeng et al, 2019). Hence, it is hardly surprising that there has been a continued interest in understanding the factors that influence research funding decisions. Such funding decisions have obvious ramifications for individual researchers (Bol et al., 2018), but the implications go beyond individuals. Both structural inequalities (Hoppe et al., 2019) and the needs of governments and funders to measure the impact of grants (Reed et al., 2021) are deeply intertwined with funding allocation. Several studies have examined the factors that influence funding decisions for individual grant applications (Bol et al., 2018; Hoppe et al., 2019; Győrffy et al., 2020). However, there is a crucial gap when it comes to tackling research funding allocation at the structural, or topic, level. Topics are widely acknowledged to play a significant role in research funding allocation at the level of individual grants (Győrffy et al., 2020; Hoppe et al., 2019). More generally, topics, broadly understood, serve as the organising focus of research communities (Held, 2022; Zeng et al., 2019). To use an analogy: if research funding is the fuel of research, then topics are the pipes it flows through. The present paper demonstrates in concrete terms how the research funding landscape can be understood in structural terms by investigating topic-level features and how they affect funding levels. In addition to the downstream effects of structural influences on individual researcher careers, these structural funding patterns have implications for measuring the socio-economic impact of research, broadly understood as benefits to people, society (including the economy), and the environment, beyond academia (Ravenscroft et al., 2017; Reed et al., 2021). 2. Background The critical role played by research funding allocation has prompted studies that look at research funding decisions from the point of structural inequalities, as well as the evaluation process leading up to funding decisions. Hoppe et al. ( 2019 ) show how, in biomedical research, African American/black scientists receive a lower rate of funding from National Institutes of Health grants, relative to white scientists. After breaking the application process data into six stages, Hoppe et al. ( 2019 ) identify topic choice as accounting for over 20% of the funding gap. Bol et al. ( 2018 ) identify a cumulative “Matthew-effect” in research funding, whereby researchers whose projects are just above the funding threshold scores go on to win disproportionately more grants than similar quality grants just below the threshold. Bol et al. ( 2018 ) link this Matthew-effect to funding inequalities among researchers that have lasting impacts on academics’ career trajectories. While they observe the effect in all fields, they argue that for topics where material and infrastructure costs are low, distributing smaller grant amounts over a larger number of researchers would reduce some of the arbitrary effects they document, and ultimately promote increased meritocracy. The grant review process has also come under scrutiny. Győrffy et al. ( 2020 ) point out that research grants are awarded near universally based on manual peer review. As a result, multidisciplinary topics have a lower grant success rate, probably as a result a lack of qualified peer reviewers. However, despite this, Győrffy et al. ( 2020 ) find only a weak correlation between reviewer scores and subsequent publication outputs from the funded projects. Conversely, the principal investigator’s scientometric profile, especially their H-index, shows a strong correlation with the subsequent publication output in highly ranked journals. Sikimić & Radovanović ( 2022 ) use machine learning to predict project efficiency in high energy physics grant applications. Despite achieving moderately high predictive accuracy, they argue for caution in practical use of such models, due the observed variation in citation patterns across subfields and topics. The findings above must be viewed in light a fluid and changing research landscape. Zeng et al. ( 2019 ) find that there is an increasing tendency for researchers to move between research topics, while McGillivray et al. ( 2022 ) show how fields change over time, with patterns of increasing research specialisation, alongside multidisciplinary trends leading to fields growing closer. What these studies have in common is that they illustrate the importance of topics and research fields as significant factors for both funding allocation decisions and the decisions by researchers on what research to pursue, which again shapes the overall landscape. Yet a topic-centric overview of the role that topic choice plays when it comes to research funding allocation is missing. The present study is a step towards filling this gap, via a large-scale, data-driven investigation of topic-level features and their relation to research funding. 3. Data and methods The main data source for the present research is Dimensions, a large bibliographic database with rich metadata (Hook et al., 2018 ). The Dimensions database was favoured over OpenAlex, due to the former’s strength in terms of metadata (Alperin et al., 2024 ). Dimensions contains document-level information with metadata indicating, among other things, year of publication and citation information. Furthermore, Dimensions also covers some of the scientific “grey literature” in the form of patents, policy documents, and clinical trials, which are linked to the research publications referenced in them. For a full overview of Dimensions data sources and the information contained in them, see the online documentation (Dimensions, n.d.). In addition to Dimensions, we made use of a new, large scale, document-level topic annotation knowledge organisation system (KOS), to group publications into topics. This KOS can be linked to all publications in Dimensions, with each publication receiving a topic classification at four different levels of granularity. The topics are organised in a mono-hierarchy, so that each publication belongs to one topic at each level, and the number of topics shifts one order of magnitude for each level: Level 1: 22 Level 2: 177 Level 3: >1,100 Level 4: >29,000 The most granular (level 4) topics were created by clustering a direct citation network using Dimensions data. These topics were then mapped, using machine learning, to the most granular fields in the Australia New Zealand Standard Research Classification Fields of Research Codes (ANZSRC) ontology (Porter et al., 2023 ). Once that link had been made, the more aggregated higher-level fields could be mapped on top directly from the ANZSRC fields of research ontology. The topic KOS, including the methods used to construct it, is described in depth in Jenset et al. ( 2025 ). For the present study, the Dimensions data was linked with the topic KOS using SQL, with both sets of data being stored in Google BigQuery tables. The topic level used was level 3, corresponding to the ANZSRC “Field”-level. This level of granularity is convenient to work with for an initial exploration, being specific yet broad enough to provide a good overview. Some example topics at this level are: Accessible Computing Acoustics and Noise Control Acute Care Adolescent Health Aerodynamics Aerospace Materials In total, funding data could be associated with 1,106 such topics for the time-period covered. To test the predictive capability of the topics, the data were grouped into time segments, using a lagged panel data design. The topic-level predictor variables were calculated for the years 2015 to 2020 (inclusive). The funding data was calculated at the topic level for the time-period 2021 to 2023 (inclusive), effectively giving us one six-year period, and one three-year period. This time segmentation was based on the average grant duration, which according to Dimensions data is 2.33 years, rounded up to 3 years to allow some margin of error. The funding information for 2021 to 2023 was calculated as follows. The grants covering the relevant time-period were identified in Dimensions. The grant amounts, in Euros, were divided equally over the publications linked to the grant. The grant amounts from different locations had been converted into Euros in the Dimensions metadata. Based on the topics assigned to the publications, the grant amount article-shares were summed per topic. For the period from 2015 to 2020, seven quantitative bibliographic features were calculated for each topic, with the aim of covering different aspects of the topic. See Table 1 for overview. Table 1 Quantitative features calculated for each topic. All features were calculated over the period from 2015 to 2020 (inclusive), based on Dimensions data. Field Type Explanation Size Integer The number of published research papers in the topic. Patents Integer The number of patents citing the documents in the topic. Policy Integer The number of policy documents citing the documents in the topic. TIF Float A citation score using the journal impact factor formula. Author type-token ratio Float The ratio of unique authors over all authors, within the topic in analogy with linguistic type-token ratio for vocabulary richness. Topic family size Integer The number of topics that belong to the same top-level topic in the hierarchy, i.e. the number of topic “siblings”. Clinical trials Integer The number of clinical trials linked to the research papers in a topic. The variables can be motivated as follows: the topic size is both a natural control factor as well as an important variable motivated by previous research (Hoppe et al., 2019 ). The TIF variable accounts for academic citations which, despite the problems associated with academic citations, are still an important measure of academic impact (Zhu et al, 2015 ). Patent-to-patent citations are linked to technological innovation, and although the status of patent-to-research article citations is less clear-cut they still represent a plausible link between academic research and commercial activity (Velayos-Ortega & López-Carreño, 2023 ). Similarly, citations from policy documents to research papers represent a possible pathway for research to influence policy decisions, but it should be kept in mind that such citations do not always entail a direct link with policy (Newson et al. 2018 , Yu et al. 2023 ). In a more specific context, clinical trials have been considered as a proxy for non-academic impact in medicine (Ovseiko et al., 2012 ). The author type-token ratio (the number of unique authors divided by all authorship instances) is an indicator modelled on type-token ratio in linguistic contexts, where the number of unique words divided by all words is often used as an approximation of vocabulary richness (Baayen, 2001 , Chap. 1). In the present context, the author type-token ratio can be interpreted as representing the degree to which the research on a topic is spread out over a large community with many one-off contributors (values closer to 1), or conversely if the community supporting a topic is composed of a stable community with many repeat contributors (values closer to 0). Finally, the topic family size captures how crowded the space around a given topic is, by counting its siblings at the same level, i.e. topics with the same parent-topic, on the hypothesis that these might compete for funding. In addition to the variables above, the top-level (level 1) topic for each topic was included in the data for use in the analysis (see details below). The following sections describe the statistical analyses, which were performed in R. See Jenset ( 2025 ) for the full data used. 4. Results Section 4.1 below gives an exploratory overview of the data. Following Tagliamonte & Baayen ( 2012 ), a model triangulation approach was used to pinpoint feature importance using three different models that capture different aspects of the relationships in the data. The following models were used: a conditional inference tree, mixed-effects regression, and Random Forest. Sections 4.2 – 4.4 present the findings for the separate models, while section 4.5 summarises the model results. Section 4.6 explores the extent to which the observed topic funding matches the expected funding, in an exploration of whether some topics are over- or under-funded, relative to their parameter values. 4.1. Exploratory analysis The summed grant amounts per topic proved to be highly unbalanced, as the overview in Table 2 shows (numbers are in thousands of Euros). Table 2 Summary of the grant amount data per topic, for the period 2021–2023, aggregated from the fractional grant shares per publication. Numbers are in thousands of Euros. Min 1st quartile Median Mean 3rd quartile Max 1 4,548 39,451 1,214,180 297,327 335,431,153 Similarly, the topic size is also highly unbalanced, as Table 3 shows. Table 3 Topic-size distribution in number of publications, for the period 2015–2020. The total number of publications covered adds up to over 131 million. Min 1st quartile Median Mean 3rd quartile Max 241 7,696 32,513 119,169 115,949 2,910,096 The number of patent citations, number of policy references, and the topic family size variables were also highly skewed. For this reason, all the skewed variables were log-transformed by taking the natural logarithm. For interpretability, the topic impact factor (TIF) and author type-token ratios were left at their original scales. The scatter plot matrix in Fig. 1 shows the pairwise correlation of the final variables in the upper half. The lower half gives the Pearson and Spearman correlation coefficients, with the corresponding p -values. As the plots in Fig. 1 show, all variables are correlated with the future funding amounts. There are varying levels of Pearson correlation between the other variables, but all are below 0.5, suggesting a low to moderate level of correlation. 4.2. Conditional inference tree model To further explore how the topic-level features can explain the variation in the grant levels, a conditional inference tree was used to visualise the salient relationships, fitted with the party package in R (Hothorn et al., 2006 ). Conditional inference trees are a type of recursive partitioning models, which can be thought of as a form of non-parametric regression. They are easy to visualise and interpret, which makes them complementary to traditional regression models, especially for exploration. The model form was: Response: log-transformed, aggregated grant amounts by topic Predictors: log-size, log-patent citations, log-policy citations, log-clinical trials, author TTR, TIF, topic family size The plot in Fig. 2 can be read top to bottom, as a flow-chart. For readability, only the four highest levels in the tree are shown. The most important variable (in terms of explaining the response), is at the top, in this case the log-transformed topic size variable. The nesting in the tree indicates interactions between variables. Variables that are not present have been omitted as unnecessary by the model. The leaf nodes show the number of topics corresponding to that combination of variable values, as well as the mean value of the response variable. The plot in Fig. 2 shows that log-size is the most important variable for predicting future grant levels (first four levels of tree shown). For topics larger than about 32,000 publications (10.378 on a logarithmic scale), we see that patents (and size again) are important predictors. For topics smaller than about 32,000 publications, the author type-token ratio variable is important. The TIF variable only shows up at the lowest level in the tree, indicating that it explains less of the response. A Pearson correlation test shows that the predicted responses from the tree model are highly correlated with the response variable ( r = .92, t = 79.102, df = 1104, p -value < .001). 4.3. Linear mixed-effects regression model To complement the conditional inference tree model, a regression model was used to capture the linear relationship between the funding levels and the predictors. The topics are not fully independent, since they sit nested within broader topics. To account for this, a linear mixed-effects model was fitted, using the lme4 package in R (Bates et al., 2015 ), adding a random intercept for the level 1 topic that the lower-level topic belongs to. The form of the linear model was as follows: Response: log-transformed grant levels by topic Fixed effects (predictors): log-size, log-patent citations, log-policy citations, log-clinical trials, author TTR, TIF, topic family size Random intercept: level 1 topic Since the author type-token ratio is a number between 0 and 1, the typical one-unit change interpretation of the regression coefficient doesn’t make sense. Hence, the variable was rescaled so that it reflects a 0.01-unit change. The model showed no sign of structural problems judging from a visual inspection of the residuals. Marginal and conditional pseudo- R 2 values indicate that the model predictors explain about 83%, and the full model 85%, of the variation in the response. Table 4 gives the predictor coefficients, along with the standard error, degrees of freedom, t -value, and p -value. Table 4 Fixed effect regression coefficients. A positive β coefficient indicates a corresponding proportional increase in funding. Predictor Beta coef Std. Error df t value p value (Intercept) -2.04 0.64 671.28 -3.20 < .001 log_size 1.16 0.04 1044.90 30.28 < .001 log_patents 0.03 0.02 1070.62 1.79 .07 log_policy 0.11 0.02 981.59 6.44 < .001 log_trials -0.02 0.01 916.41 -1.43 .15 tif 0.19 0.02 1092.11 9.88 < .001 n_l3_siblings -0.01 0.01 1077.08 -1.40 .16 author_ttr 0.08 0.01 629.70 14.62 < .001 Since the response variable is log-transformed, the coefficients can be interpreted as percentage changes. Table 4 shows that size is the variable with the largest positive effect, followed by TIF, the number of policy citations, and author type-token ratio. These effects are all highly significant, the remaining predictors are not significant at the conventional thresholds. As with the tree model, a Pearson correlation test shows that the predicted responses from the mixed effects model are highly correlated with the response variable ( r = .96, t = 111.47, df = 1104, p -value < .001), but here the correlation is even higher, suggesting even better model performance. A scrutiny of the predicted values of the linear model shows a similar log-normal distribution to the original response variable. 4.4. Random Forest model Finally, a Random Forest model was used to assess the overall, or global, predictor importance (Saarela & Jauhiainen 2021 ). Random Forest models often perform well for prediction tasks and although they can be hard to interpret directly, they produce robust estimates of predictor importance. The model form was the same as for the conditional inference tree: Response: log-transformed, aggregated grant amounts by topic Predictors: log-size, log-patent citations, log-policy citations, log-clinical trials, author TTR, TIF, topic family size A Pearson correlation test shows that the predicted responses from the Random Forest model are highly correlated with the response variable ( r = .91, t = 74.416, df = 1104, p -value < .001). The predictor importance is discussed in the next section, where it is compared to the results from the other two models. 4.5. Predictor importance To assess predictor importance, the results from each of the three models were converted into ranks, where 1 is the most important according to model-specific criteria. For the conditional inference tree, ranks were assigned based on the level in the tree where the variable first appeared. Since log-size is the root node in the tree, it was assigned rank 1, whereas log-patent citations first appear nested under the root node, giving it rank 2. Only variables appearing in the first five levels of the tree were ranked, the others were coded as 0. For the linear mixed-effects regression model, ranks were assigned based on the size of the variable coefficient, with non-significant variables assigned a rank of 0. Finally, for the Random Forest model, the predictors were ranked by their contribution to the model increase in node purity. To obtain a robust overall ranking, the median rank across the three models was calculated, as shown in Table 5 below. Table 5 Predictor importance ranks from the Random Forest (RF) model, the conditional inference tree (CIT) model, and the linear mixed-effects regression (LMER) model, summarised in the form of a median rank across models. The most important variable is coded as 1, non-significant results are coded as 0. Predictor RF rank CIT rank LMER rank Median rank log_size 1 1 1 1 log_patents 2 2 0 2 log_policy 3 5 3 3 tif 4 4 2 4 author_ttr 5 3 4 4 log_trials 6 0 0 0 n_l3_siblings 7 0 0 0 The median rank across the models shows that log-size is consistently the most important predictor across the models, and that the socio-economic impact indicators (log-patent citations, log-policy citations) are consistently ranked highly, with the next tier made up of the academic indicators (citations in the form of TIF and author TTR), sharing the fourth place. Topic family size and log-clinical trials come in last and seem less important compared to the other predictors. 4.6. Differences between predicted and observed funding levels With this information, we can answer the following question: are some topics over- or underfunded, relative to what we would expect given the information we have about them? To test this, the linear mixed-effects regression model was used to predict funding amounts for the topics. This model was chosen because it had the best overall performance in terms of correlation between predicted and observed values. Once the predictions had been calculated, the absolute difference between the predicted and the actual (or observed) funding amounts was calculated. Since the distribution of differences was symmetric and centred on zero, the following approach was used to identify significant differences between observed and predicted values: the standard deviation for the differences was calculated and compared to the individual differences. Any difference between observed and predicted values that was greater than two standard deviations from the mean was taken to be a notable difference, as an approximation to statistical significance at the .05-level. This identified 26 topics whose funding was greater than expected (Appendix A), and 28 topics whose funding was lower than expected (Appendix B). That result shows that the overwhelming majority are funded in accordance with the parameter values of the model. For the small minority of topics that receive more funding that predicted by the model, we can see traces of the cost of research infrastructure (Particle Physics, Nuclear Physics), as well as traces of policy priorities (Mathematics and Numeracy Curriculum and Pedagogy, Geriatrics and Gerontology). For the topics that receive less funding than predicted, we find a mix of topics that range from law and the humanities (Art Criticism, Taxation Law) to STEM topics like Avionics, Electrometallurgy, and Food Engineering A natural question is whether this is simply a case of STEM topics receiving more funding that Humanities and Social Sciences topics. However, such a conclusion is not supported by the data. If the list of over- or underfunded topics reflected a systematic bias in favour of STEM topics, we would expect a statistical difference between the two lists in terms of which level 1 topics were over- or underfunded. To test this, an aggregated table was created with the counts of how many times a level 1 topic was either over- or underfunded (Appendix C). Since the expected counts in this table were all below 5, Fisher’s Exact Test for count data was used to test the association between level 1 topic and over- or underfunding. A two-sided test showed no statistically significant relationship between over- and underfunding vs. level 1 topics ( p = .869). This finding is expected, since the mixed-effects regression design explicitly accounts for variations in baseline funding for level 1 topics via the random intercept. A possible interpretation of these results is that multiple processes contribute to the observed over- and underfunding patterns. While some topics might receive higher than expected funding rates due to policy considerations or high infrastructure costs, a complementary perspective is that the underfunded topics are highly efficient and “overperforming”, given that their parameters suggest a higher funding level than what is observed. 5. Discussion and conclusion This study has explored how features associated with research topics can predict future grant amounts per topic, using a conditional inference tree, a Random Forest model, and a linear mixed-effects model, within a lagged panel-design study. All models fit the data well, with the mixed-effects model showing a slightly better performance than the other two. All three models agree that past topic size is of paramount importance for predicting future levels of funding, similarly to what has been observed for a specific field like medicine (Vanderelst & Speybroeck, 2013 ). This is not unreasonable, since established, well-funded research topics will be able to produce more research which again forms the basis of future grant applications and hence be able to attract more funding from funding agencies, leading to a form of Matthew effect (Bol et al., 2018 ). Using a model-triangulation approach, it was found that across the three models, citations from patents and policy documents came second and third in importance, respectively. This could plausibly be interpreted to mean that considerations of socio-economic impact are influencing which topics receive funding. That interpretation is congruent with the findings in Vanderelst & Speybroeck ( 2013 ), who find that medical research is largely funded in accordance with disease burden, albeit skewed towards the disease burden in rich countries. At a joint fourth place, across the three models, came features related to academic impact and the research community, viz. the TIF variable and repeat authorship represented by the TTR variable. These results suggest that, all else being equal, highly cited topics are more successful at deriving funding. Similarly, the author type-token ratio result suggests that topics where we find more unique authors, relative to all authors who published in the topic, are rewarded with more funding. This could potentially be an effect of larger, better funded topics having a larger contingent research community, composed of early career researchers who publish once in the topic. Alternatively, a high author type-token ratio could point to a multi-disciplinary topic that occasionally draws on the expertise of researchers mostly publishing in other topics. More research is needed to clarify this, and the two are not necessarily mutually exclusive. Conversely, topics with a lower author type-token ratio would either way point to a more established research community publishing repeatedly on the topic, with fewer new entrants. Finally, the number of clinical trials and the immediately surrounding topic landscape (in the form of the number of topic siblings in the KOS) came last with no or only negligible effects across the three models. The trials result could be due to patents and policy documents presenting a stronger signal that washes out any effect of a simple count of clinical trials. A more targeted classification of trial information might throw more light on this. The lack of effect for the number of sibling-topics could imply that grants are sufficiently targeted and specific that related topics are not necessarily competing for funding, but further research would be needed to explore this relationship. It is important to underline that the effects above cannot simply be ascribed to differences between research fields. The linear mixed-effect model used a random intercept term for the top-level topics, such as Biological Sciences , Chemical Sciences , Economics , and History, Heritage and Archaeology . The results discussed here, in the form of the fixed effects (predictor variables), represent the influence these variables have, over and above what can be explained by the random effect , i.e. the top-level topics. In conclusion, the results discussed in the present paper suggest that larger topics are rewarded by more research funding, and that signals of socio-economic impact (policy and patent citations to research) are important secondary lagged correlates of funding. The internal academic indicators (academic citations, author type-token ratio) play a comparatively lesser, but still significant, role. This points to a hierarchy of considerations driving funding, some of which can be modelled accurately. Overall, the modelling approach employed in the present paper suggests that most topics are funded largely in accordance with what their size and other parameters would predict. For the topics whose funding is above or below expected levels, we can see traces of both the baseline infrastructure costs of doing research, as well as policy considerations around health and education. However, more research is needed to clarify the underlying dynamics of the topics marked as over- or underfunded in the present analysis. Further research will be required to explore these effects in more detail, as well as determining how they play out at different time spans, funder types, and at different topic granularity levels, as well as any geographical effects or funder-level (or type) effects. Overall, the results in the present paper clearly suggest that investigating the relation between topics and research funding is a valuable complement to research on the effect of topics on individual funding applications. The topic-centric view presented here, provides insights into the overarching structural patterns in the research funding landscape, by identifying potential system-level drivers of research funding, pointing back to both structural effects arising from policy and strategy, as well as dynamics of the research topics themselves. As such, the findings are well-positioned to inform decision-making on research policy and scientific funding. A better understanding of the structural factors at play could provide funders and policymakers with additional levers for adjusting funding strategies, as well as an evidence base for addressing topic level funding inequalities such as those identified by Bol et al. ( 2018 ) and Hoppe et al. ( 2019 ). The “Matthew-effect” identified by Bol et al. ( 2018 ) for individual researchers is equally relevant at a macro-structural topic-level, where it has potentially far-reaching ramifications in terms of funding and its consequences for the socio-economic impact of research. Declarations Open science practices The paper relies primarily on Dimensions data, which is free for research purposes. The dataset used for the statistical analysis is freely available for research purposes from Figshare, see Jenset (2025), https://doi.org/10.6084/m9.figshare.30740195. Competing interests The study uses data from the Dimensions database. Dimensions is a product from Digital Science whose owner, the Holtzbrinck Publishing Group, is also one of the owners of Springer Nature, where the author is employed. Acknowledgements This paper is a revised version of a manuscript that was peer-reviewed and accepted for presentation at the 29 th Annual STI-ENID Conference, held at the University of Bristol, UK, from 3-5 September 2025. Thanks are due to two anonymous STI-ENID reviewers who suggested improvements to the manuscript, as well as to the participants of the STI-ENID session for their questions and comments. This work was carried out within a broader project on measuring the socio-economic impact of research. The author is grateful to the project group members for feedback and discussion of early versions of this work: James Bayliss, Jack England, Daren Howell, Alison Mercer, Vera Nienaber, Roland Payton, Inês Pote, and Alex Rubleva. References Alperin JP, Portenoy J, Demes K, Larivière V, Haustein S (2024) An analysis of the suitability of OpenAlex for bibliometric analyses. arXiv preprint arXiv :240417663 Baayen RH (2001) Word frequency distributions. Springer Baayen RH, Shafaei-Bajestan E (2019) languageR: Analyzing linguistic data: A practical introduction to statistics. R package version 1.5.0. https://CRAN.R-project.org/package=languageR Bates D, Maechler M, Bolker B, Walker S (2015) Fitting linear mixed-effects models using lme4. J Stat Softw 67(1):1–48. 10.18637/jss.v067.i01 Bol T, De Vaan M, van de Rijt A (2018) The Matthew effect in science funding. Proceedings of the National Academy of Sciences, 115(19), 4887–4890 Dimensions (n.d.). Data sources. Retrieved April 9, 2025, from https://docs.dimensions.ai/dsl/data-sources.html Győrffy B, Herman P, Szabó I (2020) Research funding: past performance is a stronger predictor of future scientific output than reviewer scores. J Informetrics 14(3):101050 Held M (2022) Know thy tools! Limits of popular algorithms used for topic reconstruction. Quant Sci Stud 3(4):1054–1078 Hook DW, Porter SJ, Herzog C (2018) Dimensions: building context for search and evaluation. Front Res Metrics Analytics 3:23 Hoppe TA, Litovitz A, Willis KA, Meseroll RA, Perkins MJ, Hutchins BI, Davis AF, Lauer MS, Valantine HA, Anderson JM, Santangelo GM (2019) Topic choice contributes to the lower rate of NIH awards to African-American/black scientists. Sci Adv 5(10):eaaw7238 Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: A conditional inference framework. J Comput Graphical Stat 15(3):651–674 Jenset GB (2025) Data from Predicting future grant amounts using topic-level features. 10.6084/m9.figshare.30740195.v1 Jenset GB, Bevan P, Jain A (2025) A large-scale, granular topic classification system for scientific documents. Research Square preprint. 10.21203/rs.3.rs-6529718/v1 McGillivray B, Jenset GB, Salama K, Schut D (2022) Investigating patterns of change, stability, and interaction among scientific disciplines using embeddings. Humanit Social Sci Commun 9(1):1–15 Newson R, Rychetnik L, King L, Milat A, Bauman A (2018) Does citation matter? Research citation in policy documents as an indicator of research impact–an Australian obesity policy case-study. Health Res Policy Syst 16(1):55 Ovseiko PV, Oancea A, Buchan AM (2012) Assessing research impact in academic clinical medicine: a study using Research Excellence Framework pilot impact indicators. BMC Health Serv Res 12(1):478 Porter SJ, Hawizy L, Hook DW (2023) Recategorising research: Mapping from FoR 2008 to FoR 2020 in Dimensions. Quant Sci Stud 4(1):127–143 Ravenscroft J, Liakata M, Clare A, Duma D (2017) Measuring scientific impact beyond academia: An assessment of existing impact metrics and proposed improvements. PLoS ONE, 12(3), e0173152 Reed MS, Ferré M, Martin-Ortega J, Blanche R, Lawford-Rolfe R, Dallimer M, Holden J (2021) Evaluating impact from research: A methodological framework. Res Policy 50(4):104147 Saarela M, Jauhiainen S (2021) Comparison of feature importance measures as explanations for classification models. SN Appl Sci 3(2):272 Sikimić V, Radovanović S (2022) Machine learning in scientific grant review: algorithmically predicting project efficiency in high energy physics. Eur J Philos Sci 12(3):50 Tagliamonte SA, Baayen RH (2012) Models, forests, and trees of York English: Was/were variation as a case study for statistical practice. Lang Variation Change 24(2):135–178 Vanderelst D, Speybroeck N (2013) Scientometrics reveals funding priorities in medical research policy. J Informetrics 7(1):240–247 Velayos-Ortega G, López-Carreño R (2023) Indicators for measuring the impact of scientific citations in patents. World Patent Inf 72:102171 Yu H, Murat B, Li J, Li L (2023) How can policy document mentions to scholarly papers be interpreted? An analysis of the underlying mentioning process. Scientometrics 128(11):6247–6266 Zeng A, Shen Z, Zhou J, Fan Y, Di Z, Wang Y, Stanley HE, Havlin S (2019) Increasing trend of scientists to switch between topics. Nat Commun 10(1):3439 Zhu X, Turney P, Lemire D, Vellino A (2015) Measuring academic influence: Not all citations are equal. J Association Inform Sci Technol 66(2):408–427 Additional Declarations The authors declare potential competing interests as follows: The study uses data from the Dimensions database. Dimensions is a product from Digital Science whose owner, the Holtzbrinck Publishing Group, is also one of the owners of Springer Nature, where the author is employed. Supplementary Files Appendix.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8248518","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":553285625,"identity":"c4ce77ec-8828-43c9-9a8d-0de39f91143c","order_by":0,"name":"Gard B Jenset","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABN0lEQVRIie2PMUvDQBTHLzw4lwtdT6zmK5xksILo6NfIEYiLiNJJcAgIzRLMGlHI4BfopB1TAplOugYKpaWQuaEgBkRNgkMgEXETvN/wjvd4P/7vEJJI/igUkepV5kUhv1EAWE1hPygVgGlt+L2iHT8vEtQ9ONMcL77MR7PuXuchXp9fvfEnx1ZW2WlD2RUneg8Rq89EhKeq6JN9PzXv/JjxexHC5u1jU7Gt4h4S8SE18VQZGIQlQgeCGfepgUFtUby0VD54ECzxRV4pkzWQ90LR5q2KRquUkNsJYKSWysQFUAdlCmpVGE2hZxCTD4Wpb6miTME6qDe67hN+3fYXzbOUJHMPeeCMF1k+Mo7YJFoCednZ9jei8SprSQmLwt2vTrGLQg1Ub5sp1fS1vtMJ2xYlEonkH/MJh8Znz2S2droAAAAASUVORK5CYII=","orcid":"https://orcid.org/0000-0001-7423-3112","institution":"Springer Nature","correspondingAuthor":true,"prefix":"","firstName":"Gard","middleName":"B","lastName":"Jenset","suffix":""}],"badges":[],"createdAt":"2025-12-01 09:32:49","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":true,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-8248518/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8248518/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":97230267,"identity":"e0354618-60a3-436e-97c4-dabdf224efbb","added_by":"auto","created_at":"2025-12-02 09:09:12","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1101308,"visible":true,"origin":"","legend":"","description":"","filename":"jenset2025predictinggrants.docx","url":"https://assets-eu.researchsquare.com/files/rs-8248518/v1/320911d8c3ede6e8e2adea3f.docx"},{"id":97230264,"identity":"29e95c48-6ea0-4428-89de-03b05fd7211b","added_by":"auto","created_at":"2025-12-02 09:09:12","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":342,"visible":true,"origin":"","legend":"","description":"","filename":"rs8248518.json","url":"https://assets-eu.researchsquare.com/files/rs-8248518/v1/fe2eb36df4f9f781ccb9d9c1.json"},{"id":97230266,"identity":"e449d043-072a-461d-838b-058700eb949c","added_by":"auto","created_at":"2025-12-02 09:09:12","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":98723,"visible":true,"origin":"","legend":"","description":"","filename":"rs82485180enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8248518/v1/72116a6431e5cf2d88718d0c.xml"},{"id":97251348,"identity":"51e6aa72-de73-4ddc-9dbb-82b16f72b798","added_by":"auto","created_at":"2025-12-02 13:16:50","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":289931,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8248518/v1/1fcf0245b82c91d7f29fdd91.png"},{"id":97230269,"identity":"846dc002-3ff7-4c59-bc17-69f1fc115f81","added_by":"auto","created_at":"2025-12-02 09:09:12","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":138511,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8248518/v1/d97390452d6e942555bb7624.png"},{"id":97230272,"identity":"9effc406-2467-46f0-aa3f-0fc31d8d8daa","added_by":"auto","created_at":"2025-12-02 09:09:13","extension":"xml","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":97992,"visible":true,"origin":"","legend":"","description":"","filename":"rs82485180structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8248518/v1/7122fd32bdc3689d3d0e87cb.xml"},{"id":97230271,"identity":"cc46f89f-cb38-4834-ac46-5ac0f54e0189","added_by":"auto","created_at":"2025-12-02 09:09:13","extension":"html","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":101787,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8248518/v1/8eb771bf7975d075f5c7c1b2.html"},{"id":97230265,"identity":"1b287fba-943a-4391-92f0-45fba00e2e58","added_by":"auto","created_at":"2025-12-02 09:09:12","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":666797,"visible":true,"origin":"","legend":"\u003cp\u003eScatter plot matrix of the numeric variables (upper diagonal), and Pearson and Spearman correlation coefficients (lower diagonal). The plot was created with the \u003cem\u003elanguageR\u003c/em\u003epackage in R (Baayen \u0026amp; Shafaei-Bajestan, 2019).\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8248518/v1/75c21c40690714e1d09da216.png"},{"id":97250296,"identity":"ffdb6336-3cc4-46ba-afd7-684bf1ac9058","added_by":"auto","created_at":"2025-12-02 13:14:13","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":368592,"visible":true,"origin":"","legend":"\u003cp\u003eVisualisation of the conditional inference tree model.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8248518/v1/8a9fc19cd2aa57f8b0963e03.png"},{"id":97252560,"identity":"76615ebb-aa85-4403-b527-b20a4d556d9a","added_by":"auto","created_at":"2025-12-02 13:22:27","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1426177,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8248518/v1/1d219dfa-8ed0-45b5-a34c-eab8b1b52d77.pdf"},{"id":97230263,"identity":"3d92da2c-cef5-4735-9ad6-445593b7a7e9","added_by":"auto","created_at":"2025-12-02 09:09:12","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":17361,"visible":true,"origin":"","legend":"","description":"","filename":"Appendix.docx","url":"https://assets-eu.researchsquare.com/files/rs-8248518/v1/e6033b4c08fda8ac6f318a33.docx"}],"financialInterests":"The authors declare potential competing interests as follows: The study uses data from the Dimensions database. Dimensions is a product from Digital Science whose owner, the Holtzbrinck Publishing Group, is also one of the owners of Springer Nature, where the author is employed.","formattedTitle":"\u003cp\u003ePredicting future grant amounts using topic-level features\u003c/p\u003e","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eResearch funding is a crucial pre-requisite for advances in innovation and research (Bol et al., 2018; Győrffy et al., 2020; Zeng et al, 2019). Hence, it is hardly surprising that there has been a continued interest in understanding the factors that influence research funding decisions. Such funding decisions have obvious ramifications for individual researchers (Bol et al., 2018), but the implications go beyond individuals. Both structural inequalities (Hoppe et al., 2019) and the needs of governments and funders to measure the impact of grants (Reed et al., 2021) are deeply intertwined with funding allocation. Several studies have examined the factors that influence funding decisions for individual grant applications (Bol et al., 2018; Hoppe et al., 2019; Győrffy et al., 2020).\u003c/p\u003e\n\u003cp\u003eHowever, there is a crucial gap when it comes to tackling research funding allocation at the structural, or topic, level. Topics are widely acknowledged to play a significant role in research funding allocation at the level of individual grants (Győrffy et al., 2020; Hoppe et al., 2019). More generally, topics, broadly understood, serve as the organising focus of research communities (Held, 2022; Zeng et al., 2019). To use an analogy: if research funding is the fuel of research, then topics are the pipes it flows through. The present paper demonstrates in concrete terms how the research funding landscape can be understood in structural terms by investigating topic-level features and how they affect funding levels. In addition to the downstream effects of structural influences on individual researcher careers, these structural funding patterns have implications for measuring the socio-economic impact of research, broadly understood as benefits to people, society (including the economy), and the environment, beyond academia (Ravenscroft et al., 2017; Reed et al., 2021).\u0026nbsp;\u003c/p\u003e"},{"header":"2. Background","content":"\u003cp\u003eThe critical role played by research funding allocation has prompted studies that look at research funding decisions from the point of structural inequalities, as well as the evaluation process leading up to funding decisions. Hoppe et al. (\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) show how, in biomedical research, African American/black scientists receive a lower rate of funding from National Institutes of Health grants, relative to white scientists. After breaking the application process data into six stages, Hoppe et al. (\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) identify topic choice as accounting for over 20% of the funding gap. Bol et al. (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) identify a cumulative \u0026ldquo;Matthew-effect\u0026rdquo; in research funding, whereby researchers whose projects are just above the funding threshold scores go on to win disproportionately more grants than similar quality grants just below the threshold. Bol et al. (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) link this Matthew-effect to funding inequalities among researchers that have lasting impacts on academics\u0026rsquo; career trajectories. While they observe the effect in all fields, they argue that for topics where material and infrastructure costs are low, distributing smaller grant amounts over a larger number of researchers would reduce some of the arbitrary effects they document, and ultimately promote increased meritocracy.\u003c/p\u003e\u003cp\u003eThe grant review process has also come under scrutiny. Győrffy et al. (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) point out that research grants are awarded near universally based on manual peer review. As a result, multidisciplinary topics have a lower grant success rate, probably as a result a lack of qualified peer reviewers. However, despite this, Győrffy et al. (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) find only a weak correlation between reviewer scores and subsequent publication outputs from the funded projects. Conversely, the principal investigator\u0026rsquo;s scientometric profile, especially their H-index, shows a strong correlation with the subsequent publication output in highly ranked journals. Sikimić \u0026amp; Radovanović (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2022\u003c/span\u003e) use machine learning to predict project efficiency in high energy physics grant applications. Despite achieving moderately high predictive accuracy, they argue for caution in practical use of such models, due the observed variation in citation patterns across subfields and topics.\u003c/p\u003e\u003cp\u003eThe findings above must be viewed in light a fluid and changing research landscape. Zeng et al. (\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) find that there is an increasing tendency for researchers to move between research topics, while McGillivray et al. (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2022\u003c/span\u003e) show how fields change over time, with patterns of increasing research specialisation, alongside multidisciplinary trends leading to fields growing closer.\u003c/p\u003e\u003cp\u003eWhat these studies have in common is that they illustrate the importance of topics and research fields as significant factors for both funding allocation decisions and the decisions by researchers on what research to pursue, which again shapes the overall landscape. Yet a topic-centric overview of the role that topic choice plays when it comes to research funding allocation is missing. The present study is a step towards filling this gap, via a large-scale, data-driven investigation of topic-level features and their relation to research funding.\u003c/p\u003e"},{"header":"3. Data and methods","content":"\u003cp\u003eThe main data source for the present research is Dimensions, a large bibliographic database with rich metadata (Hook et al., \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). The Dimensions database was favoured over OpenAlex, due to the former\u0026rsquo;s strength in terms of metadata (Alperin et al., \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). Dimensions contains document-level information with metadata indicating, among other things, year of publication and citation information. Furthermore, Dimensions also covers some of the scientific \u0026ldquo;grey literature\u0026rdquo; in the form of patents, policy documents, and clinical trials, which are linked to the research publications referenced in them. For a full overview of Dimensions data sources and the information contained in them, see the online documentation (Dimensions, n.d.).\u003c/p\u003e\u003cp\u003eIn addition to Dimensions, we made use of a new, large scale, document-level topic annotation knowledge organisation system (KOS), to group publications into topics. This KOS can be linked to all publications in Dimensions, with each publication receiving a topic classification at four different levels of granularity. The topics are organised in a mono-hierarchy, so that each publication belongs to one topic at each level, and the number of topics shifts one order of magnitude for each level:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eLevel 1: 22\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eLevel 2: 177\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eLevel 3: \u0026gt;1,100\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eLevel 4: \u0026gt;29,000\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe most granular (level 4) topics were created by clustering a direct citation network using Dimensions data. These topics were then mapped, using machine learning, to the most granular fields in the Australia New Zealand Standard Research Classification Fields of Research Codes (ANZSRC) ontology (Porter et al., \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Once that link had been made, the more aggregated higher-level fields could be mapped on top directly from the ANZSRC fields of research ontology. The topic KOS, including the methods used to construct it, is described in depth in Jenset et al. (\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2025\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eFor the present study, the Dimensions data was linked with the topic KOS using SQL, with both sets of data being stored in Google BigQuery tables. The topic level used was level 3, corresponding to the ANZSRC \u0026ldquo;Field\u0026rdquo;-level. This level of granularity is convenient to work with for an initial exploration, being specific yet broad enough to provide a good overview. Some example topics at this level are:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eAccessible Computing\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eAcoustics and Noise Control\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eAcute Care\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eAdolescent Health\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eAerodynamics\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eAerospace Materials\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eIn total, funding data could be associated with 1,106 such topics for the time-period covered. To test the predictive capability of the topics, the data were grouped into time segments, using a lagged panel data design. The topic-level predictor variables were calculated for the years 2015 to 2020 (inclusive). The funding data was calculated at the topic level for the time-period 2021 to 2023 (inclusive), effectively giving us one six-year period, and one three-year period. This time segmentation was based on the average grant duration, which according to Dimensions data is 2.33 years, rounded up to 3 years to allow some margin of error.\u003c/p\u003e\u003cp\u003eThe funding information for 2021 to 2023 was calculated as follows. The grants covering the relevant time-period were identified in Dimensions. The grant amounts, in Euros, were divided equally over the publications linked to the grant. The grant amounts from different locations had been converted into Euros in the Dimensions metadata. Based on the topics assigned to the publications, the grant amount article-shares were summed per topic.\u003c/p\u003e\u003cp\u003eFor the period from 2015 to 2020, seven quantitative bibliographic features were calculated for each topic, with the aim of covering different aspects of the topic. See Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e for overview.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eQuantitative features calculated for each topic. All features were calculated over the period from 2015 to 2020 (inclusive), based on Dimensions data.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eField\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eType\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eExplanation\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSize\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eInteger\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe number of published research papers in the topic.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePatents\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eInteger\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe number of patents citing the documents in the topic.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePolicy\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eInteger\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe number of policy documents citing the documents in the topic.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTIF\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFloat\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eA citation score using the journal impact factor formula.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAuthor type-token ratio\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFloat\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe ratio of unique authors over all authors, within the topic in analogy with linguistic type-token ratio for vocabulary richness.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTopic family size\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eInteger\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe number of topics that belong to the same top-level topic in the hierarchy, i.e. the number of topic \u0026ldquo;siblings\u0026rdquo;.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eClinical trials\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eInteger\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe number of clinical trials linked to the research papers in a topic.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThe variables can be motivated as follows: the topic size is both a natural control factor as well as an important variable motivated by previous research (Hoppe et al., \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). The TIF variable accounts for academic citations which, despite the problems associated with academic citations, are still an important measure of academic impact (Zhu et al, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2015\u003c/span\u003e). Patent-to-patent citations are linked to technological innovation, and although the status of patent-to-research article citations is less clear-cut they still represent a plausible link between academic research and commercial activity (Velayos-Ortega \u0026amp; L\u0026oacute;pez-Carre\u0026ntilde;o, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Similarly, citations from policy documents to research papers represent a possible pathway for research to influence policy decisions, but it should be kept in mind that such citations do not always entail a direct link\u003c/p\u003e\u003cp\u003ewith policy (Newson et al. \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2018\u003c/span\u003e, Yu et al. \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). In a more specific context, clinical trials have been considered as a proxy for non-academic impact in medicine (Ovseiko et al., \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2012\u003c/span\u003e). The author type-token ratio (the number of unique authors divided by all authorship instances) is an indicator modelled on type-token ratio in linguistic contexts, where the number of unique words divided by all words is often used as an approximation of vocabulary richness (Baayen, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2001\u003c/span\u003e, Chap.\u0026nbsp;1). In the present context, the author type-token ratio can be interpreted as representing the degree to which the research on a topic is spread out over a large community with many one-off contributors (values closer to 1), or conversely if the community supporting a topic is composed of a stable community with many repeat contributors (values closer to 0). Finally, the topic family size captures how crowded the space around a given topic is, by counting its siblings at the same level, i.e. topics with the same parent-topic, on the hypothesis that these might compete for funding.\u003c/p\u003e\u003cp\u003eIn addition to the variables above, the top-level (level 1) topic for each topic was included in the data for use in the analysis (see details below). The following sections describe the statistical analyses, which were performed in R. See Jenset (\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) for the full data used.\u003c/p\u003e"},{"header":"4. Results","content":"\u003cp\u003eSection \u003cspan refid=\"Sec4\" class=\"InternalRef\"\u003e4.1\u003c/span\u003e below gives an exploratory overview of the data. Following Tagliamonte \u0026amp; Baayen (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2012\u003c/span\u003e), a model triangulation approach was used to pinpoint feature importance using three different models that capture different aspects of the relationships in the data. The following models were used: a conditional inference tree, mixed-effects regression, and Random Forest. Sections \u003cspan refid=\"Sec5\" class=\"InternalRef\"\u003e4.2\u003c/span\u003e\u0026ndash;\u003cspan refid=\"Sec7\" class=\"InternalRef\"\u003e4.4\u003c/span\u003e present the findings for the separate models, while section \u003cspan refid=\"Sec8\" class=\"InternalRef\"\u003e4.5\u003c/span\u003e summarises the model results. Section \u003cspan refid=\"Sec9\" class=\"InternalRef\"\u003e4.6\u003c/span\u003e explores the extent to which the observed topic funding matches the expected funding, in an exploration of whether some topics are over- or under-funded, relative to their parameter values.\u003c/p\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e4.1. Exploratory analysis\u003c/h2\u003e\u003cp\u003eThe summed grant amounts per topic proved to be highly unbalanced, as the overview in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e shows (numbers are in thousands of Euros).\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eSummary of the grant amount data per topic, for the period 2021\u0026ndash;2023, aggregated from the fractional grant shares per publication. Numbers are in thousands of Euros.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMin\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003e1st quartile\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMedian\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eMean\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003e3rd quartile\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eMax\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e4,548\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e39,451\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1,214,180\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e297,327\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e335,431,153\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eSimilarly, the topic size is also highly unbalanced, as Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e shows.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eTopic-size distribution in number of publications, for the period 2015\u0026ndash;2020. The total number of publications covered adds up to over 131\u0026nbsp;million.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMin\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003e1st quartile\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMedian\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eMean\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003e3rd quartile\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eMax\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e241\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e7,696\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e32,513\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e119,169\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e115,949\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e2,910,096\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThe number of patent citations, number of policy references, and the topic family size variables were also highly skewed. For this reason, all the skewed variables were log-transformed by taking the natural logarithm. For interpretability, the topic impact factor (TIF) and author type-token ratios were left at their original scales. The scatter plot matrix in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e shows the pairwise correlation of the final variables in the upper half. The lower half gives the Pearson and Spearman correlation coefficients, with the corresponding \u003cem\u003ep\u003c/em\u003e-values.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eAs the plots in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e show, all variables are correlated with the future funding amounts. There are varying levels of Pearson correlation between the other variables, but all are below 0.5, suggesting a low to moderate level of correlation.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e4.2. Conditional inference tree model\u003c/h2\u003e\u003cp\u003eTo further explore how the topic-level features can explain the variation in the grant levels, a conditional inference tree was used to visualise the salient relationships, fitted with the \u003cem\u003eparty\u003c/em\u003e package in R (Hothorn et al., \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2006\u003c/span\u003e). Conditional inference trees are a type of recursive partitioning models, which can be thought of as a form of non-parametric regression. They are easy to visualise and interpret, which makes them complementary to traditional regression models, especially for exploration.\u003c/p\u003e\u003cp\u003eThe model form was:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eResponse: log-transformed, aggregated grant amounts by topic\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ePredictors: log-size, log-patent citations, log-policy citations, log-clinical trials, author TTR, TIF, topic family size\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe plot in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e can be read top to bottom, as a flow-chart. For readability, only the four highest levels in the tree are shown. The most important variable (in terms of explaining the response), is at the top, in this case the log-transformed topic size variable. The nesting in the tree indicates interactions between variables. Variables that are not present have been omitted as unnecessary by the model. The leaf nodes show the number of topics corresponding to that combination of variable values, as well as the mean value of the response variable.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThe plot in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e shows that log-size is the most important variable for predicting future grant levels (first four levels of tree shown). For topics larger than about 32,000 publications (10.378 on a logarithmic scale), we see that patents (and size again) are important predictors. For topics smaller than about 32,000 publications, the author type-token ratio variable is important. The TIF variable only shows up at the lowest level in the tree, indicating that it explains less of the response.\u003c/p\u003e\u003cp\u003eA Pearson correlation test shows that the predicted responses from the tree model are highly correlated with the response variable (\u003cem\u003er\u003c/em\u003e\u0026thinsp;=\u0026thinsp;.92, \u003cem\u003et\u003c/em\u003e\u0026thinsp;=\u0026thinsp;79.102, df\u0026thinsp;=\u0026thinsp;1104, \u003cem\u003ep\u003c/em\u003e-value\u0026thinsp;\u0026lt;\u0026thinsp;.001).\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e4.3. Linear mixed-effects regression model\u003c/h2\u003e\u003cp\u003eTo complement the conditional inference tree model, a regression model was used to capture the linear relationship between the funding levels and the predictors. The topics are not fully independent, since they sit nested within broader topics. To account for this, a linear mixed-effects model was fitted, using the \u003cem\u003elme4\u003c/em\u003e package in R (Bates et al., \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2015\u003c/span\u003e), adding a random intercept for the level 1 topic that the lower-level topic belongs to. The form of the linear model was as follows:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eResponse: log-transformed grant levels by topic\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eFixed effects (predictors): log-size, log-patent citations, log-policy citations, log-clinical trials, author TTR, TIF, topic family size\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eRandom intercept: level 1 topic\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eSince the author type-token ratio is a number between 0 and 1, the typical one-unit change interpretation of the regression coefficient doesn\u0026rsquo;t make sense. Hence, the variable was rescaled so that it reflects a 0.01-unit change.\u003c/p\u003e\u003cp\u003eThe model showed no sign of structural problems judging from a visual inspection of the residuals. Marginal and conditional pseudo-\u003cem\u003eR\u003c/em\u003e\u003csup\u003e\u003cem\u003e2\u003c/em\u003e\u003c/sup\u003e values indicate that the model predictors explain about 83%, and the full model 85%, of the variation in the response. Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e gives the predictor coefficients, along with the standard error, degrees of freedom, \u003cem\u003et\u003c/em\u003e-value, and \u003cem\u003ep\u003c/em\u003e-value.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eFixed effect regression coefficients. A positive β coefficient indicates a corresponding proportional increase in funding.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePredictor\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eBeta coef\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eStd. Error\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003edf\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003e\u003cem\u003et\u003c/em\u003e value\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u003cem\u003ep\u003c/em\u003e value\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e(Intercept)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e-2.04\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e671.28\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e-3.20\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003elog_size\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e1.16\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.04\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1044.90\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e30.28\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003elog_patents\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.03\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.02\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1070.62\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e1.79\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e.07\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003elog_policy\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.11\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.02\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e981.59\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e6.44\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003elog_trials\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e-0.02\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.01\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e916.41\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e-1.43\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e.15\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003etif\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.19\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.02\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1092.11\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e9.88\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003en_l3_siblings\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e-0.01\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.01\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1077.08\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e-1.40\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e.16\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eauthor_ttr\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.08\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.01\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e629.70\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e14.62\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eSince the response variable is log-transformed, the coefficients can be interpreted as percentage changes.\u003c/p\u003e\u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e shows that size is the variable with the largest positive effect, followed by TIF, the number of policy citations, and author type-token ratio. These effects are all highly significant, the remaining predictors are not significant at the conventional thresholds.\u003c/p\u003e\u003cp\u003eAs with the tree model, a Pearson correlation test shows that the predicted responses from the mixed effects model are highly correlated with the response variable (\u003cem\u003er\u003c/em\u003e\u0026thinsp;=\u0026thinsp;.96, \u003cem\u003et\u003c/em\u003e\u0026thinsp;=\u0026thinsp;111.47, df\u0026thinsp;=\u0026thinsp;1104, \u003cem\u003ep\u003c/em\u003e-value\u0026thinsp;\u0026lt;\u0026thinsp;.001), but here the correlation is even higher, suggesting even better model performance. A scrutiny of the predicted values of the linear model shows a similar log-normal distribution to the original response variable.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\u003ch2\u003e4.4. Random Forest model\u003c/h2\u003e\u003cp\u003eFinally, a Random Forest model was used to assess the overall, or global, predictor importance (Saarela \u0026amp; Jauhiainen \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). Random Forest models often perform well for prediction tasks and although they can be hard to interpret directly, they produce robust estimates of predictor importance. The model form was the same as for the conditional inference tree:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eResponse: log-transformed, aggregated grant amounts by topic\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ePredictors: log-size, log-patent citations, log-policy citations, log-clinical trials, author TTR, TIF, topic family size\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eA Pearson correlation test shows that the predicted responses from the Random Forest model are highly correlated with the response variable (\u003cem\u003er\u003c/em\u003e\u0026thinsp;=\u0026thinsp;.91, \u003cem\u003et\u003c/em\u003e\u0026thinsp;=\u0026thinsp;74.416, df\u0026thinsp;=\u0026thinsp;1104, \u003cem\u003ep\u003c/em\u003e-value\u0026thinsp;\u0026lt;\u0026thinsp;.001). The predictor importance is discussed in the next section, where it is compared to the results from the other two models.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003e4.5. Predictor importance\u003c/h2\u003e\u003cp\u003eTo assess predictor importance, the results from each of the three models were converted into ranks, where 1 is the most important according to model-specific criteria. For the conditional inference tree, ranks were assigned based on the level in the tree where the variable first appeared. Since log-size is the root node in the tree, it was assigned rank 1, whereas log-patent citations first appear nested under the root node, giving it rank 2. Only variables appearing in the first five levels of the tree were ranked, the others were coded as 0. For the linear mixed-effects regression model, ranks were assigned based on the size of the variable coefficient, with non-significant variables assigned a rank of 0. Finally, for the Random Forest model, the predictors were ranked by their contribution to the model increase in node purity. To obtain a robust overall ranking, the median rank across the three models was calculated, as shown in Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e below.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003ePredictor importance ranks from the Random Forest (RF) model, the conditional inference tree (CIT) model, and the linear mixed-effects regression (LMER) model, summarised in the form of a median rank across models. The most important variable is coded as 1, non-significant results are coded as 0.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePredictor\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eRF rank\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCIT rank\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eLMER rank\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eMedian rank\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003elog_size\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003elog_patents\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e2\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003elog_policy\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e3\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003etif\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eauthor_ttr\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003elog_trials\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e6\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003en_l3_siblings\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThe median rank across the models shows that log-size is consistently the most important predictor across the models, and that the socio-economic impact indicators (log-patent citations, log-policy citations) are consistently ranked highly, with the next tier made up of the academic indicators (citations in the form of TIF and author TTR), sharing the fourth place. Topic family size and log-clinical trials come in last and seem less important compared to the other predictors.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e4.6. Differences between predicted and observed funding levels\u003c/h2\u003e\u003cp\u003eWith this information, we can answer the following question: are some topics over- or underfunded, relative to what we would expect given the information we have about them? To test this, the linear mixed-effects regression model was used to predict funding amounts for the topics. This model was chosen because it had the best overall performance in terms of correlation between predicted and observed values. Once the predictions had been calculated, the absolute difference between the predicted and the actual (or observed) funding amounts was calculated. Since the distribution of differences was symmetric and centred on zero, the following approach was used to identify significant differences between observed and predicted values: the standard deviation for the differences was calculated and compared to the individual differences. Any difference between observed and predicted values that was greater than two standard deviations from the mean was taken to be a notable difference, as an approximation to statistical significance at the .05-level.\u003c/p\u003e\u003cp\u003eThis identified 26 topics whose funding was greater than expected (Appendix A), and 28 topics whose funding was lower than expected (Appendix B). That result shows that the overwhelming majority are funded in accordance with the parameter values of the model. For the small minority of topics that receive more funding that predicted by the model, we can see traces of the cost of research infrastructure (Particle Physics, Nuclear Physics), as well as traces of policy priorities (Mathematics and Numeracy Curriculum and Pedagogy, Geriatrics and Gerontology). For the topics that receive less funding than predicted, we find a mix of topics that range from law and the humanities (Art Criticism, Taxation Law) to STEM topics like Avionics, Electrometallurgy, and Food Engineering\u003c/p\u003e\u003cp\u003eA natural question is whether this is simply a case of STEM topics receiving more funding that Humanities and Social Sciences topics. However, such a conclusion is not supported by the data. If the list of over- or underfunded topics reflected a systematic bias in favour of STEM topics, we would expect a statistical difference between the two lists in terms of which level 1 topics were over- or underfunded. To test this, an aggregated table was created with the counts of how many times a level 1 topic was either over- or underfunded (Appendix C). Since the expected counts in this table were all below 5, Fisher\u0026rsquo;s Exact Test for count data was used to test the association between level 1 topic and over- or underfunding. A two-sided test showed no statistically significant relationship between over- and underfunding vs. level 1 topics (\u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;.869). This finding is expected, since the mixed-effects regression design explicitly accounts for variations in baseline funding for level 1 topics via the random intercept.\u003c/p\u003e\u003cp\u003eA possible interpretation of these results is that multiple processes contribute to the observed over- and underfunding patterns. While some topics might receive higher than expected funding rates due to policy considerations or high infrastructure costs, a complementary perspective is that the underfunded topics are highly efficient and \u0026ldquo;overperforming\u0026rdquo;, given that their parameters suggest a higher funding level than what is observed.\u003c/p\u003e\u003c/div\u003e"},{"header":"5. Discussion and conclusion","content":"\u003cp\u003eThis study has explored how features associated with research topics can predict future grant amounts per topic, using a conditional inference tree, a Random Forest model, and a linear mixed-effects model, within a lagged panel-design study. All models fit the data well, with the mixed-effects model showing a slightly better performance than the other two. All three models agree that past topic size is of paramount importance for predicting future levels of funding, similarly to what has been observed for a specific field like medicine (Vanderelst \u0026amp; Speybroeck, \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2013\u003c/span\u003e). This is not unreasonable, since established, well-funded research topics will be able to produce more research which again forms the basis of future grant applications and hence be able to attract more funding from funding agencies, leading to a form of Matthew effect (Bol et al., \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2018\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eUsing a model-triangulation approach, it was found that across the three models, citations from patents and policy documents came second and third in importance, respectively. This could plausibly be interpreted to mean that considerations of socio-economic impact are influencing which topics receive funding. That interpretation is congruent with the findings in Vanderelst \u0026amp; Speybroeck (\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2013\u003c/span\u003e), who find that medical research is largely funded in accordance with disease burden, albeit skewed towards the disease burden in rich countries. At a joint fourth place, across the three models, came features related to academic impact and the research community, viz. the TIF variable and repeat authorship represented by the TTR variable. These results suggest that, all else being equal, highly cited topics are more successful at deriving funding. Similarly, the author type-token ratio result suggests that topics where we find more unique authors, relative to all authors who published in the topic, are rewarded with more funding. This could potentially be an effect of larger, better funded topics having a larger contingent research community, composed of early career researchers who publish once in the topic. Alternatively, a high author type-token ratio could point to a multi-disciplinary topic that occasionally draws on the expertise of researchers mostly publishing in other topics. More research is needed to clarify this, and the two are not necessarily mutually exclusive. Conversely, topics with a lower author type-token ratio would either way point to a more established research community publishing repeatedly on the topic, with fewer new entrants.\u003c/p\u003e\u003cp\u003eFinally, the number of clinical trials and the immediately surrounding topic landscape (in the form of the number of topic siblings in the KOS) came last with no or only negligible effects across the three models. The trials result could be due to patents and policy documents presenting a stronger signal that washes out any effect of a simple count of clinical trials. A more targeted classification of trial information might throw more light on this. The lack of effect for the number of sibling-topics could imply that grants are sufficiently targeted and specific that related topics are not necessarily competing for funding, but further research would be needed to explore this relationship.\u003c/p\u003e\u003cp\u003eIt is important to underline that the effects above cannot simply be ascribed to differences between research fields. The linear mixed-effect model used a random intercept term for the top-level topics, such as \u003cem\u003eBiological Sciences\u003c/em\u003e, \u003cem\u003eChemical Sciences\u003c/em\u003e, \u003cem\u003eEconomics\u003c/em\u003e, and \u003cem\u003eHistory, Heritage and Archaeology\u003c/em\u003e. The results discussed here, in the form of the fixed effects (predictor variables), represent the influence these variables have, \u003cem\u003eover and above what can be explained by the random effect\u003c/em\u003e, i.e. the top-level topics.\u003c/p\u003e\u003cp\u003eIn conclusion, the results discussed in the present paper suggest that larger topics are rewarded by more research funding, and that signals of socio-economic impact (policy and patent citations to research) are important secondary lagged correlates of funding. The internal academic indicators (academic citations, author type-token ratio) play a comparatively lesser, but still significant, role. This points to a hierarchy of considerations driving funding, some of which can be modelled accurately. Overall, the modelling approach employed in the present paper suggests that most topics are funded largely in accordance with what their size and other parameters would predict. For the topics whose funding is above or below expected levels, we can see traces of both the baseline infrastructure costs of doing research, as well as policy considerations around health and education. However, more research is needed to clarify the underlying dynamics of the topics marked as over- or underfunded in the present analysis.\u003c/p\u003e\u003cp\u003eFurther research will be required to explore these effects in more detail, as well as determining how they play out at different time spans, funder types, and at different topic granularity levels, as well as any geographical effects or funder-level (or type) effects. Overall, the results in the present paper clearly suggest that investigating the relation between topics and research funding is a valuable complement to research on the effect of topics on individual funding applications. The topic-centric view presented here, provides insights into the overarching structural patterns in the research funding landscape, by identifying potential system-level drivers of research funding, pointing back to both structural effects arising from policy and strategy, as well as dynamics of the research topics themselves. As such, the findings are well-positioned to inform decision-making on research policy and scientific funding. A better understanding of the structural factors at play could provide funders and policymakers with additional levers for adjusting funding strategies, as well as an evidence base for addressing topic level funding inequalities such as those identified by Bol et al. (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) and Hoppe et al. (\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). The \u0026ldquo;Matthew-effect\u0026rdquo; identified by Bol et al. (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) for individual researchers is equally relevant at a macro-structural topic-level, where it has potentially far-reaching ramifications in terms of funding and its consequences for the socio-economic impact of research.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eOpen science practices\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe paper relies primarily on Dimensions data, which is free for research purposes. The dataset used for the statistical analysis is freely available for research purposes from Figshare, see Jenset (2025), https://doi.org/10.6084/m9.figshare.30740195.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe study uses data from the Dimensions database. Dimensions is a product from Digital Science whose owner, the Holtzbrinck Publishing Group, is also one of the owners of Springer Nature, where the author is employed.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis paper is a revised version of a manuscript that was peer-reviewed and accepted for presentation at the 29\u003csup\u003eth\u003c/sup\u003e Annual STI-ENID Conference, held at the University of Bristol, UK, from 3-5 September 2025. Thanks are due to two anonymous STI-ENID reviewers who suggested improvements to the manuscript, as well as to the participants of the STI-ENID session for their questions and comments. This work was carried out within a broader project on measuring the socio-economic impact of research. The author is grateful to the project group members for feedback and discussion of early versions of this work: James Bayliss, Jack England, Daren Howell, Alison Mercer, Vera Nienaber, Roland Payton, In\u0026ecirc;s Pote, and Alex Rubleva.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eAlperin JP, Portenoy J, Demes K, Larivi\u0026egrave;re V, Haustein S (2024) An analysis of the suitability of OpenAlex for bibliometric analyses. arXiv preprint arXiv :240417663\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBaayen RH (2001) Word frequency distributions. Springer\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBaayen RH, Shafaei-Bajestan E (2019) languageR: Analyzing linguistic data: A practical introduction to statistics. R package version 1.5.0. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://CRAN.R-project.org/package=languageR\u003c/span\u003e\u003cspan address=\"https://CRAN.R-project.org/package=languageR\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBates D, Maechler M, Bolker B, Walker S (2015) Fitting linear mixed-effects models using lme4. J Stat Softw 67(1):1\u0026ndash;48. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.18637/jss.v067.i01\u003c/span\u003e\u003cspan address=\"10.18637/jss.v067.i01\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBol T, De Vaan M, van de Rijt A (2018) The Matthew effect in science funding. Proceedings of the National Academy of Sciences, 115(19), 4887\u0026ndash;4890\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDimensions (n.d.). Data sources. Retrieved April 9, 2025, from \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://docs.dimensions.ai/dsl/data-sources.html\u003c/span\u003e\u003cspan address=\"https://docs.dimensions.ai/dsl/data-sources.html\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGyőrffy B, Herman P, Szab\u0026oacute; I (2020) Research funding: past performance is a stronger predictor of future scientific output than reviewer scores. J Informetrics 14(3):101050\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHeld M (2022) Know thy tools! Limits of popular algorithms used for topic reconstruction. Quant Sci Stud 3(4):1054\u0026ndash;1078\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHook DW, Porter SJ, Herzog C (2018) Dimensions: building context for search and evaluation. Front Res Metrics Analytics 3:23\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHoppe TA, Litovitz A, Willis KA, Meseroll RA, Perkins MJ, Hutchins BI, Davis AF, Lauer MS, Valantine HA, Anderson JM, Santangelo GM (2019) Topic choice contributes to the lower rate of NIH awards to African-American/black scientists. Sci Adv 5(10):eaaw7238\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: A conditional inference framework. J Comput Graphical Stat 15(3):651\u0026ndash;674\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJenset GB (2025) Data from Predicting future grant amounts using topic-level features. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.6084/m9.figshare.30740195.v1\u003c/span\u003e\u003cspan address=\"10.6084/m9.figshare.30740195.v1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJenset GB, Bevan P, Jain A (2025) A large-scale, granular topic classification system for scientific documents. Research Square preprint. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.21203/rs.3.rs-6529718/v1\u003c/span\u003e\u003cspan address=\"10.21203/rs.3.rs-6529718/v1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMcGillivray B, Jenset GB, Salama K, Schut D (2022) Investigating patterns of change, stability, and interaction among scientific disciplines using embeddings. Humanit Social Sci Commun 9(1):1\u0026ndash;15\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNewson R, Rychetnik L, King L, Milat A, Bauman A (2018) Does citation matter? Research citation in policy documents as an indicator of research impact\u0026ndash;an Australian obesity policy case-study. Health Res Policy Syst 16(1):55\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOvseiko PV, Oancea A, Buchan AM (2012) Assessing research impact in academic clinical medicine: a study using Research Excellence Framework pilot impact indicators. BMC Health Serv Res 12(1):478\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePorter SJ, Hawizy L, Hook DW (2023) Recategorising research: Mapping from FoR 2008 to FoR 2020 in Dimensions. Quant Sci Stud 4(1):127\u0026ndash;143\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRavenscroft J, Liakata M, Clare A, Duma D (2017) Measuring scientific impact beyond academia: An assessment of existing impact metrics and proposed improvements. PLoS ONE, 12(3), e0173152\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eReed MS, Ferr\u0026eacute; M, Martin-Ortega J, Blanche R, Lawford-Rolfe R, Dallimer M, Holden J (2021) Evaluating impact from research: A methodological framework. Res Policy 50(4):104147\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSaarela M, Jauhiainen S (2021) Comparison of feature importance measures as explanations for classification models. SN Appl Sci 3(2):272\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSikimić V, Radovanović S (2022) Machine learning in scientific grant review: algorithmically predicting project efficiency in high energy physics. Eur J Philos Sci 12(3):50\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTagliamonte SA, Baayen RH (2012) Models, forests, and trees of York English: Was/were variation as a case study for statistical practice. Lang Variation Change 24(2):135\u0026ndash;178\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eVanderelst D, Speybroeck N (2013) Scientometrics reveals funding priorities in medical research policy. J Informetrics 7(1):240\u0026ndash;247\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eVelayos-Ortega G, L\u0026oacute;pez-Carre\u0026ntilde;o R (2023) Indicators for measuring the impact of scientific citations in patents. World Patent Inf 72:102171\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYu H, Murat B, Li J, Li L (2023) How can policy document mentions to scholarly papers be interpreted? An analysis of the underlying mentioning process. Scientometrics 128(11):6247\u0026ndash;6266\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZeng A, Shen Z, Zhou J, Fan Y, Di Z, Wang Y, Stanley HE, Havlin S (2019) Increasing trend of scientists to switch between topics. Nat Commun 10(1):3439\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhu X, Turney P, Lemire D, Vellino A (2015) Measuring academic influence: Not all citations are equal. J Association Inform Sci Technol 66(2):408\u0026ndash;427\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Springer Nature (United Kingdom)","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"scientific topics, scientometrics, research funding, science of science","lastPublishedDoi":"10.21203/rs.3.rs-8248518/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8248518/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThis study addresses research funding allocation, in the form of a large-scale empirical investigation into the structural factors, at the topic-level, that predict funding levels in the form of aggregated grant amounts, using a lagged panel-data approach. The topic-centric focus sets this work apart from previous research on which features predict success for individual grant applications. Understanding the topic-level dynamics around research funding provides a crucial complement to the individual-grant perspective, with a potential for informing research strategy and scientific priorities. Employing a data-driven approach based on large-scale data across more than 1,100 topics covering over 130 million publications, the study demonstrates that in addition to topic size, signals of socio-economic impact in the form of links to patents and policy documents, as well as citation patterns and aspects of the researcher community active in the topic, are significant predictors of future funding levels. A model triangulation approach, combining conditional inference trees, linear mixed effects regression, and Random Forest, reveals a clear hierarch of predictor importance across different statistical models. Clear signals of socio-economic impact driving research funding emerge both from this hierarchy of effects, as well as an outlier analysis. Taken together, the results provide compelling evidence for the structural, topic-level traces of socio-economic impact that influence research grants.\u003c/p\u003e","manuscriptTitle":"Predicting future grant amounts using topic-level features","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-02 09:09:08","doi":"10.21203/rs.3.rs-8248518/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"c5d17a9b-4b74-477a-9d24-dae5ff09acc9","owner":[],"postedDate":"December 2nd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":58868161,"name":"Information Retrieval and Management"}],"tags":[],"updatedAt":"2025-12-02T09:09:08+00:00","versionOfRecord":[],"versionCreatedAt":"2025-12-02 09:09:08","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8248518","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8248518","identity":"rs-8248518","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.