Scientific Committee guidance on appraising and integrating evidence from epidemiological studies for use in EFSA's scientific assessments.

OA: gold CC-BY-ND-4.0
Full text 288,167 characters · extracted from pmc-nxml · 10 sections · click to expand

Data

The concepts used for the development of this guidance are covered in standard textbooks in human (Lash et al.,  2021 ), animal (Dohoo et al.,  2009 ) and plant disease (Cooke et al.,  2006 ; Madden et al.,  2007 ) epidemiology, as well as textbooks covering more focused topics such as nutritional epidemiology (Willett,  2012 ). Published papers, book chapters and reports from the biomedical literature are referred to where appropriate in support of arguments, statements and examples. Concerning methodology, this guidance is drawing on basic concepts and methodologies within epidemiology, explains them and provides recommendations on how to use them in the context of EFSA's work. ToR 3 demands that the guidance addresses the specific scientific assessment questions of the different EFSA panels. Therefore, the experience of the different scientific panels of EFSA in using epidemiological studies in their scientific assessments, and the specific questions for which guidance was needed, were collected via a questionnaire submitted to the EFSA coordinators of all 10 panels and their chairs. The responses were used to refine the guidance.

Panel

Vasileios Bampidis, Diane Benford, Claude Bragard, Thorhallur I. Halldorsson, Antonio Hernandez‐Jerez, Susanne Hougaard Bennekou, Konstantinos Koutsoumanis, Claude Lambré, Kyriaki Machera, Wim Mennes, Ewen Mullins, Simon More, Soren Saxmose Nielsen, Josef Schlatter, Dieter Schrenk, Dominique Turck and Maged Younes.

Audience

The aim of this guidance is to facilitate multidisciplinary and integrative scientific assessments, in particular to facilitate better integration of epidemiological and toxicological data. The guidance provides a harmonised, but flexible framework that is applicable to all areas of EFSA's work and all types of scientific assessment, including risk assessment. In line with improving transparency, the Scientific Committee considers the application of this guidance to be unconditional for EFSA. Assessors have the flexibility to choose appropriate methods and the degree of refinement in applying them.

Guidance

EFSA has published in recent years a number of cross‐cutting guidance documents with the aim of further increasing robustness, transparency and openness of its scientific assessments . Altogether the documents cover major approaches to the use and interpretation of data and scientific evidence in risk assessments. In the PROMETHEUS project (‘PROmoting METHods for Evidence Use in Scientific assessments’), EFSA defined a set of principles for the scientific assessment process and a 4‐step approach (plan/carry out/verify/report) for their fulfilment (EFSA,  2015 ), which was piloted in 10 case studies, one from each EFSA panel. According to PROMETHEUS, the process of validating or appraising evidence must be planned for, conducted consistently, verified and thoroughly documented. To do so, pre‐defined criteria must be applied to all individual studies of the same design included in the assessment. This is important as study appraisal both informs and influences the integration process, where all potentially relevant data are considered and weighted together. Based on the results from the pilot phase, several limitations were identified, and recommendations were made (EFSA,  2018 ). These included: the lack of guidance and of agreed in‐house appraisal tools; the need for standardised templates that account for the diversity of the evidence; the lack of expertise in appraising studies using structured approaches; the need for multidisciplinary working groups (WGs) of experts (statisticians, epidemiologists, domain experts). the lack of guidance and of agreed in‐house appraisal tools; the need for standardised templates that account for the diversity of the evidence; the lack of expertise in appraising studies using structured approaches; the need for multidisciplinary working groups (WGs) of experts (statisticians, epidemiologists, domain experts). In the ‘Guidance on the assessment of the biological relevance of data in scientific assessments’ (EFSA Scientific Committee,  2017 ), biological relevance is considered at three main stages of the process of dealing with evidence. In that document, it is stated that 'For each effect, the first step is to determine whether it is causally related to the exposure or treatment, for instance according to the Bradford Hill viewpoints (Hill,  1965 )'. Therefore, even if aspects related to the reliability of the various pieces of evidence used in the assessment are outside the scope of this guidance document, evidence appraisal is acknowledged as being a necessary step to reach conclusions about exposure–health associations. The Weight of Evidence (WoE) guidance document (EFSA Scientific Committee,  2017 ) provides a general framework for considering and documenting the approach used to evaluate and weigh the assembled evidence when answering the main question of each scientific assessment. This includes assessing the relevance, reliability and consistency of the evidence. Lastly, the guidance on protocol development for EFSA generic scientific assessment (EFSA Scientific Committee,  2023a ) lays out a harmonised and flexible framework for developing protocols that consists of two main steps. In the first step, problem formulation, the APRIO (Agent, Pathway, Receptor, Intervention and Output) approach is introduced to translate the ToR into assessment questions. The APRIO approach aims to bridge the challenge in applying the PICO/PECO (Population, Intervention/Exposure, Comparator, Outcome) approach to the EFSA remit. In the second step, protocol development, the evidence needs and the methods that will be applied for answering the assessment questions, including uncertainty analysis, need to be specified. In line with the concepts and approaches set out in these guidance documents, this document can be considered an addition that addresses needs for specific guidance on appraisal and integration of evidence from epidemiological studies (Sections  4.3 and 4.4 ). In recent decades, principles and methodology of epidemiology have undergone rapid development. This development has partly been driven by advancements in methods for collecting and analysing large scale data. Several definitions of epidemiology have been proposed, with older definitions often being narrower in scope, stating for example that 'Epidemiology is concerned with the patterns of disease occurrence in human populations and the factors that influence these patterns' (Lilienfeld & Lilienfeld,  1980 ). Acknowledging the broader scope of epidemiological research today, Porta in the Dictionary of Epidemiology ( 2014 ) defined epidemiology as: The study of the occurrence and distribution of health‐related events, states and processes in specified populations, including the study of the determinants influencing such processes, and the application of this knowledge to control relevant health problems. The study of the occurrence and distribution of health‐related events, states and processes in specified populations, including the study of the determinants influencing such processes, and the application of this knowledge to control relevant health problems. By this definition, epidemiology is not just the study of disease 1 but also the study of any health‐related endpoints/outcomes, 2 including risk factors, surrogate outcomes, and biomarkers of exposure/effect. Indeed, epidemiology provides a set of tools and methodologies to describe outcomes of interest in defined populations. Such outcomes can be of a variety of types (e.g. infection, disease, immunity to specific diseases, presence of certain conditions such as raised blood pressure or blood lipids, hard disease endpoints such as stroke, cancer occurrence). Epidemiological studies may also cover the occurrence of outcomes in members of a population where a direct link with health has not been well characterised so far. The growing number of studies in humans and animals linking environmental exposures to the composition of gut microbiota (Clemente et al.,  2012 ; Lee & Hase,  2014 ; Snedeker & Hay,  2012 ; Wolter et al.,  2021 ) is one example of such studies. The Dictionary of Epidemiology definition is relevant for the study of ‘…health‐related events, states and processes…’ in any population, these being either humans, animals or plants. Regardless of the type of populations under study, these populations need to be defined explicitly. Although there can be important differences in design and conduct, it follows that general epidemiological principles and considerations also apply in settings, which have traditionally been viewed as non‐epidemiological. As an example, potential biases that can occur in studies in laboratory animals are often the same as those encountered in experimental studies in humans. Such similarities in methodology generally exist across different fields of epidemiology (animals, humans). Although various classifications exist (Lesko et al.,  2020 ), epidemiological studies can be broadly classified as either descriptive or analytical. In descriptive studies, patterns of exposures or outcomes of interest are described across one or more factors, such as over time and place, while in analytical studies, relationships between identifiable factors and outcomes of interest are quantified. Analytical studies can be classified as either experimental or non‐experimental studies, with the latter often referred to as observational studies. Descriptive epidemiological studies have the objective of describing and/or comparing the occurrence of exposure or outcome in a population (e.g. humans, livestock or companion animals, plants) over factors such as time and space. When all members of a defined population can be examined, i.e. when a census is possible, the characteristic(s) of interest can be determined directly. In practice, this is often not possible, for logistical or other reasons. In those cases, surveys need to be conducted, in which a sample is taken from the population of interest and their characteristics are then measured. These include prevalence or surveillance surveys of specific characteristics. The value of the estimates resulting from such surveys also depends on the representativeness of the sampling in relation to the scope of the survey. Examples of such studies of relevance for EFSA are The European Union One Health 2022 Zoonoses Report (EFSA and ECDC,  2023a ); reports on antimicrobial resistance (EFSA and ECDC,  2023b ); surveys for plant harmful organisms ('pests') relevant to the EU's plant health policy for which EFSA provides survey data sheets (EFSA,  2020 ) and reports on pesticide residues in food (EFSA,  2020 ) and dietary surveys (Ioannidou et al.,  2021 ). A brief description of the most common designs of analytical epidemiological studies, both experimental and observational, is given in this section. Different designs exist due to their abilities to extract information under varying experimental or non‐experimental (observational settings) for health outcomes of different frequency, severity and with varying latency periods. Experimental studies (also named ‘intervention studies’ or ‘trials’) are primarily confined to experiments where the exposure conditions are controlled by the researcher to examine what effect an intervention may have on the population under study. In Randomised Controlled Trials (RCTs) , factors that may affect the outcome are (on average) balanced out by randomly allocating study participants to different treatments (two or more groups). At the end of the experiment, the groups are then compared with respect to the outcome of interest (parallel design). The unit of randomisation can either be the individual or a group of individuals within the study population (cluster randomisation). Examples of clusters are school units, families and neighbourhoods. Cluster randomisation may be chosen to serve convenience, to overcome ethical concerns raised by individual randomisation, to avoid departures from treatment or to assess group effects. Examples of experimental studies of randomised design include pharmaceutical trials, trials with foods, including dietary supplements, and changes in dietary patterns and toxicological studies in experimental animals. 3 The sample size needs to be sufficiently large to allow matched/stratified analyses and to better control for confounders, and to ensure a sufficient precision to the risk (or effect) estimates that are expected to be generated by the study (Rothman & Greenland,  2018 ). In humans, variability in lifestyle and genetic factors is generally high. As a result, larger sample sizes are needed, as compared to experimental studies in laboratory animals. In RCTs, random treatment allocation alone is, however, not sufficient to achieve unbiased results. Blinding the investigators and caretakers (e.g. in the case of children and animals) to treatment assignment and outcome detection and assessment is essential to avoid bias resulting from unintended differences in co‐intervention of the experimental groups or differences in the assessment of the outcome. For participants, being blinded to the treatment received is equally important to avoid bias from selective dropout, changes in behaviour or departures 4 from the assigned treatment (Dodd et al.,  2011 , 2012 ). When both investigators and study participants are blinded to treatment, these studies are traditionally referred to as double blind RCTs . If appropriately designed and conducted, they are expected to provide an unbiased measure of effect (gold standard). Several variants of the RCT design exist. The simplest variant is when double blinding cannot be achieved. This is the case for many interventions, such as those based on nutritional and lifestyle changes, that test the efficacy of treatments when the exposure of interest cannot be masked. For example, for many foods and dietary components, such as fish oils or different types of sweeteners, participant blinding is difficult or impossible to achieve, while blinding of the investigators can still sometimes be ensured. Similar problems arise in many cognitive and physical activity interventions. Lack of blinding at the participant level may lead to selective dropout and differential departures from the assigned treatment, outcome detection bias, or confounding. Another design is the crossover trial (either randomised or not) where each participant receives both (or all) treatments in a sequential order, with a suitable ‘wash out’ period in between. This design has the advantage that each participant acts as its own control, which is more effective in balancing external factors than comparing different participants randomly allocated to two or more treatment groups. The limitation of this design is that it is only suitable for treatments where the anticipated effects are short term and fully reversible. That is, no carry‐over effects between treatments are expected to occur, and response to treatment can be assumed to be independent of the order in which it is assigned. This design is often used when comparing the short‐term effects of different treatments on clinical biomarkers, such as blood pressure or blood lipids. A variant of the crossover design in occupational settings is when the researcher changes workers' exposure by removing them temporarily (or permanently) from their workplace (or assigning them to other tasks), to see if their health conditions (such as asthma or allergies) improve (partial crossover design). Other types of intervention studies that are relevant for the area of food safety are the so‐called Phase 0, I and II clinical trials . These trials are conducted to assess safety, pharmacokinetics and ‐dynamics of pharmaceuticals in humans prior to conducting larger scale RCTs (Phase III trials). One characteristic of their design is that they may not include a well‐defined control group for comparison and the study population can be quite different from the target population that the intended treatment is designed for. In Phase 0 trials , a group of healthy participants are given micro‐doses of the test substance. Such trials are aimed at detecting potential adverse effect at low doses and/or provide relevant information on pharmacokinetics. General conclusions on the effect of the exposure are, however, hampered as this design does not include a control group. Further, healthy participants may be less likely to respond to treatment compared to more sensitive individuals. In Phase I trials , often called ‘dose escalation trials’, participants are dosed in small groups going from low to high doses to assess the safety or tolerability of the test substance. These studies may involve sensitive sub‐groups (patients) and they provide valuable information on both pharmacokinetics and tolerability. However, in terms of evaluating the potential health effects across dose groups, the small number of participants per dose and possible dropout due to adverse events means that randomisation across dose groups is variably successful. As a result, bias (confounding) may occur. Phase II trials are designed to test therapeutic doses of the test substance often in sensitive individuals. These trials are of smaller scale (sample size) than Phase III trials and vary by design in terms of use of controls (currently preferred medication or placebo). Although Phase 0, I and II trials are mostly used for pharmaceuticals, these designs (or variants of these designs) may cover exposures falling under EFSA remit. An example of such studies includes a Phase 0 trial studying the pharmacokinetics of bisphenol A (Völkel et al.,  2002 ), a Phase I dose escalation trial of caffeine (Altman et al.,  2011 ) and advantame (Warrington et al.,  2011 ) and in the area of novel foods a Phase II trial examining the possible therapeutic effects of flavanol‐containing cocoa (Balzer et al.,  2008 ). Experimental studies involve a variety of ethical considerations . An extensive and rigid framework of ethical standards is in place, and it is constantly evolving based on new developments and their challenges. These ethical standards aim to safeguard the participants' safety, autonomy, and equal and respectful treatment within the experimental study (World Medical Association,  2013 ). Even when no apparent harm (side effect) is expected, such as in preventive interventions, the design of the study should ensure the best interest of the participants, including active surveillance for unexpected adverse events. In several cases, experimental studies aimed at testing beneficial effects of presumably non‐harmful doses of nutritional substances at low doses, such as micronutrients and other food supplements, have shown unexpected harmful effects (Blumberg & Block,  1994 , 5 ; Lippman et al.,  2009 , 6 ; Kristal et al.,  2014 , 7 ). These examples clearly highlight the importance of being cautious and maintaining high ethical standards when conducting experimental studies. In non‐experimental (observational) epidemiological studies, the researcher does not control the circumstances or the amount of exposure. Instead, the researcher observes the outcome of interest in a given population, whose members may have been exposed to certain factors, inadvertently or by choice. The exposure of interest is observed (and quantified, where possible) and its relationship with the studied outcome assessed. The level and variation of the observed exposure reflect how participants have been exposed within their surroundings, which includes occupation and differences in dietary habits and other factors. Associations between exposures and outcomes are identified from such studies, but it needs to be ascertained whether the observed associations are attributed correctly to the exposure of interest. In fact, confounding may occur if other determinants of the outcome are not randomly associated with the exposure. For example, a study may find that elderly people with serum 25(OH)D (vitamin D) above 75 mmol/L perform better on physical function tests than those with vitamin D status below 50 mmol/L. Such an association may be confounded by the simple fact that those participants who are healthier (less frail) may spend more time outdoors, and therefore have higher serum vitamin D level, at least partly due to their exposure to the sun. If the observed association between physical function tests and level of vitamin D is attributed to vitamin D, confounding may occur. In practice, it is usually impossible to record and fully account for all factors that may influence the outcome. However, a possible replication of findings across different study populations with support from other experimental findings in vivo and/or in vitro, a low risk of bias (RoB) in such studies, and biological plausibility of the observed association between exposure and outcome(s) may support a stronger case for or against causality. Caution should be taken that the same biases may consistently exist in different studies across varying populations. The main observational epidemiological study types are cohort, case–control, cross‐sectional and ecological studies. These designs differ mainly in terms of selection of study participants, the timing between assessment of exposure and the outcome; and whether one or the other is assessed on an individual or group level. In cohort study designs, 8 a source population is defined, and participants (the study population) are classified according to their exposure(s). Participants are then ‘followed‐up’ for a specified period of time (the risk period), during which the outcome of interest is evaluated and compared across the exposure groups, while taking potential confounding factors into consideration. One advantage of this design is that it can be ascertained at the beginning of the study whether participants are free of the outcome of interest without the risk of differential misclassification depending on the outcome status. After a follow‐up time considered sufficient to cover the known or assumed induction period of the outcome (or disease), it can then be examined if the exposure may have contributed to the development of the outcome. Frequently used variants of cohort studies are the case–cohort and cohort‐nested case–control studies. These are generally more compact designs requiring smaller number of study participants and are often used for efficiency reasons, for example when chemical analyses or clinical assessment cannot be performed for all cohort participants (O'Brien et al.,  2022 ). In general, cohort studies are more resource demanding and difficult to conduct than other types of epidemiological studies, and the time it takes to generate results depends on the induction period of the outcome. Several large cohorts have been created to address many different exposure – outcome associations, including rare diseases, that are studied over time (e.g. the EPIC project, 9 the Danish National Birth Cohort, 10 the Avon Longitudinal Study of Parents and Children, 11 UK Biobank 12 ). Epidemiological studies frequently distinguish their findings among those related to the primary endpoint(s) or hypotheses. Different endpoints, related to additional objectives, are often added over time to the original study protocol, and this applies to both observational studies as well as experimental studies. In addition, studies presenting numerous disease outcomes may or may not adjust the presented p ‐values for multiple testing when different hypothesis are tested within the same study. Such adjustments are, however, generally not performed to account for different hypotheses presented in different studies. In general, a distinction between primary and secondary study hypotheses in terms of internal validity is useful in interpreting effect estimates (or published p ‐values). When primary and secondary endpoints are inter‐related, they may re‐enforce each other in terms of biological plausibility. Both types of endpoints are worth considering, particularly for assessing findings across studies. In such cases, the distinction between primary or secondary endpoints is less relevant. Cohort studies can be prospective, historical (retrospective) or a combination of both . In prospective cohort studies, information on exposures is collected prior to assessment of the outcome while for historical studies the exposure and/or the outcome are assessed back in time (retrospectively). However, even a historical cohort study may involve exposure information that was recorded prospectively, e.g. a historical occupational cohort study may involve following participants over several years from recruitment, but the exposure information may be based on archived blood samples, clinical or other records collected prior to recruitment at the time that the relevant exposures occurred. In historical cohort studies, the exposure has already occurred before the study, but the outcome has yet to occur. This is a useful setup for assessing health outcomes with a long induction period and exposures that could trigger several outcomes of interest (Lazcano et al.,  2019 ). Case–control studies recruit participants based on the outcome of interest. That is, participants with a certain disease or health state (cases) and an appropriate group of participants that do not have such condition at the time of enrolment (controls) are recruited from the same source population. Thus, a case–control study involves studying cases (from a specific source population) and a sample of non‐cases (ideally from the same source population). The distribution of past or current exposures among cases and non‐cases (controls) is then compared, adjusting for confounding factors. Further details on the different types of case–control studies can be found in the paper of Knol et al. ( 2008 ). It is important that selection of controls is conducted at random, i.e. that controls are a random sample of the source population over the risk period, with the qualification that controls may be matched to cases on some key factors such as age and gender. The strength of this design is its efficiency compared to the cohort design. In fact, case–control studies should be viewed in the context of a specific source population, in the sense that all cases from this population or a representative sample of these – over a defined period of time – are included in the study, and only a sample of non‐cases is taken from the population. This sample of non‐cases is used to estimate the distribution of exposures and confounders of interest in the source population from which the cases arose. The gain in efficiency of the case–control design derives from the fact that in a cohort study of the same source population, the entire population would have been studied. This gain is even more pronounced if the outcome under study is rare. Similarly, case–control studies may be based on historical records or may involve interviews about historical or current exposures. The latter approach may result in problems if the health condition influences quantification of current or past exposures. For example, cancer cases may recall their past exposure differently than non‐cases, even in situations when the exposure being assessed is not causally related to their disease condition (the same holds for many other health conditions). In addition, differences in the presence of certain health conditions among cases and controls, such as impaired kidney function or inflammation, can influence the measured concentrations for many biomarkers of exposure. In summary, the presence of certain health conditions when exposure is being assessed can create a spurious correlation between the quantified exposure and the health outcome under consideration, e.g. reverse causation. Assessing the exposure prospectively prior to the occurrence of the outcome should reduce the risk of such spurious correlation. It is a common misunderstanding that case–control studies are always of lower value compared to cohort studies or randomised trials due to increased RoB. Past exposures can often be accurately assessed retrospectively through archived biomaterials stored in biobanks, or through access to high‐quality health records or other similar sources, e.g. registries. If past exposures can be assessed in such manner, with appropriate temporal separation in relation to the outcome assessment, the RoB due to the exposure assessment should be comparable to that of a prospective design. Thus, for case–control studies the RoB is largely determined by (differential and non‐differential) exposure misclassification, i.e. by how and when the exposure was assessed (retrospectively based on participant recall, cross‐sectional or assessment of past exposures from high‐quality records) and selection bias. Other types of observational epidemiological study designs include the cross‐sectional design and the ecological study design. In cross‐sectional studies , a group of participants is recruited at one specific point in time, and information on both outcome and exposure is ascertained simultaneously. By design, it is often not possible to ascertain whether the exposure occurred before the outcome; therefore, the directionality of the observed association is often uncertain. That is, in some cases, the outcome (health state) itself, directly or through behavioural changes, may influence the parameter being assessed as exposure, as a result of reverse causation. The risk of such bias strongly depends on the time period that the measured exposure reflects and the health outcome under consideration. This has to be evaluated on a study‐by‐study basis. Still, a cross‐sectional design may often be appropriate, such as in cases when exposure has short‐term effects or for hypothesis generation. For example, for relatively rare exposures such as consumption of glycyrrhetinic acid from liquorice, which affects blood pressure (Sigurjónsdóttir et al.,  2001 ), a simple cross‐sectional study recording consumption for the past day and measuring blood pressure at the same time would be more appropriate than a cohort design that prospectively correlates exposures recorded in the previous year to current blood pressure. Additionally, no problems with reverse causation would exist in cross‐sectional studies, for risk factors that do not change (e.g. blood type, genetic factors, etc.). Finally, in ecological studies, 13 the units of observation are groups of participants defined, for example by region or community. Health‐related states and exposures are measured, for example by rates in geographical areas, and their relation is examined. Limitations of these studies are lack of individual assessment and difficulty in accounting for confounders on a group level. Since the exact status of each member of the population (either in terms of exposure or in terms of outcome, or both) cannot be ascertained, the ‘ecologic fallacy’ may be produced. That is, the relationship between averages of population exposures and outcomes may not represent the relationship between exposure and outcomes at the individual level. However, despite this limitation, ecological studies sometimes have the advantage of achieving large exposure gradients, as exposure to certain nutrients or contaminants is generally greater across different units of observation than within individual units (Willett,  2012 ). In summary, each of the observational study designs reviewed above has its strengths and limitations. Despite case–control and particularly cohort studies being generally considered to provide a higher certainty of evidence, for certain exposures and outcomes also cross‐sectional and ecological studies can provide valuable information for risk assessment, complementing other lines of evidence. The basic principles of design and analysis of epidemiological studies are the same for human, animal and plant populations. Veterinary epidemiology, although based on the same methodological and study design principles as human epidemiology, often has different challenges to address, while some aspects of the execution of epidemiological studies may be simpler in livestock or companion animals rather than in human populations. For example, compared to humans, animal populations are sometimes easier to access, observe, control, test and follow‐up. On the other hand, not all animals are individually identifiable, as is the case in intensively reared chicken, fish or wildlife. In those cases, probability sampling of populations and formation of study groups of individuals can prove very challenging or impossible. Additionally, exposures, outcomes and confounders may not be possible to assess at the individual level. In those cases, it may be necessary to use an entire group of animals (population of an entire fish tank, or an entire room of broilers, etc.) as the unit of the epidemiological study (in which case the exposures, outcomes and confounders are assessed at the group level). Sometimes, it may be feasible to introduce manipulations that make individual identification of animals possible, but this may not necessarily be part of the usual routine of animal rearing, and therefore it could affect the study findings. As in any other branch of epidemiology, the definition of the target population, study population, enrolment process and sampling when dealing with animal populations is made with regard to the study objective, feasibility and bias minimisation. Studies on companion animals can have more similarities with human studies than studies on farm animals or wildlife. The hierarchical structure of farmed animal populations (e.g. different levels of organisation and possible social structures of such populations, clustering within production or housing units, litter, etc.) requires consideration in the design of the study and use of appropriate statistical methodology when analysing its results. Studies on wildlife are typically restricted to descriptive or cross‐sectional designs. Unique challenges exist on ascertainment of cases in wildlife studies when the entire population is not easily accessible. This is because observation of animals with the condition under study can, in those cases, be very challenging or dependent on other factors. For example, sick or dead wild animals may not be found, unless they are close to routes of human movements without ever being observed. The estimation of population sizes in these cases is a study objective in its own and requires specific methods (e.g. using capture–recapture). Information on population size is required as a denominator in measures of disease frequency. Population size, on the other hand, is usually not a challenge in farmed animal production systems (except sometimes when entire production units, or even entire farms, are the epidemiological unit). In all cases, daily operation of the system and the planning of the production need to be taken into consideration when conducting the study. In veterinary epidemiology, obtaining exposure, disease and confounder information needs to focus on animal owners, breeders or farmers or on records or proxy measurements; therefore, the reliability of these sources of information always needs to be assessed. Distortions due to human behavioural or cognitive factors (compliance, non‐response, recall and other intentional or non‐intentional interferences with sampling, treatment or diagnosis) may still occur and, therefore, influence exposure or outcome assessments, treatment of study animals and other aspects of the epidemiological study. In plant health, an epidemic has been defined simply as 'the change in intensity of a disease in time and space' (Madden et al.,  2007 ). Plant health focuses mainly on infectious disease (rather than non‐communicable disease). A considerable number of plant health threats are caused by the invasion and spread of herbivorous insect populations in addition to pathogenic microorganisms, and the EFSA Plant Health (PLH) Panel thus operates at the intersection of epidemiology and population ecology. Consequently, fields of study relevant to the PLH Panel can be found in the study of infectious disease of humans and animals (e.g. Diekmann & Heesterbeek,  2000 ), and invasive species and entomology (e.g. Cock & Wittenberg,  2001 ). There is also a very strong focus on the environmental drivers of insect pest and pathogen populations in plant health, which are a major contributing factor to epidemics. Indeed, plant pest risk is usually viewed through the lens of the 'disease triangle' where there must be overlapping availability of host, pathogen and conducive environmental conditions for an epidemic to occur, with particular emphasis on the latter (Madden et al.,  2007 ). In contrast to human and, to a lesser extent, animal disease epidemiology, plant health is concerned with a very large number of wild and domesticated host species. For example, Xylella fastidiosa , a current major plant health threat in the EU, is known to infect over 696 plant species (EFSA,  2023b ). Despite this difference, the One Health concept, which has been used to unify human, animal, plant and environmental studies, has been identified as an opportunity to better integrate plant health (Boa et al.,  2015 ) with a few examples of broadening the approaches commonly used in plant health (e.g. Rizzo et al.,  2021 ). Integrated Pest Management (IPM) and especially the agroecological approach could perfectly fit into the One Health approach, allowing to achieve food safety and food security, while reducing the impact on the environment. The EFSA PLH Panel uses a mechanistic population‐based approach to capture the dynamics of insect pest and pathogen populations through the attributes of the disease triangle. This involves the definition of a conceptual model to compute changes in the population abundance and distribution across the different assessment steps (EFSA PLH Panel,  2018 ). For typical quantitative pest risk assessments (QPRA), questions are framed by the ISPM (International Standards for Phytosanitary Measures), in particular ISPM2 11 on entry, establishment, spread and impact of pest populations. These activities are supported by up‐to‐date panel guidance documents (EFSA PLH Panel,  2018 , 2019 ). Problems encountered include the availability of data to parameterise pest risk models (but which can be supported by Expert Knowledge Elicitation (EKE)) as well as transferability of models in space and time, including the assessment of climate suitability and climate change. In general, this is exacerbated by the limited number of epidemiological studies in plant health from which to synthesise information. Though this uncertainty is in part offset, since small deviations in risk can in general be tolerated in plant health, which is often not the case in human disease. In simple terms, causality is the process where one factor leads to the production of another process or state. Section  4.1.5.1 gives a short description of some of the existing theoretical frameworks on causality that have been developed within epidemiology. Considerations on how to make inferences of causality based on different study designs are then given in Sections  4.1.5.2 and 4.1.5.3 . It should be noted that in general, the level that a study is aimed at (e.g. molecular, individual, population) needs to be considered when weighing the evidence for causation. Much of the theoretical framework for causality in epidemiological studies has been developed in the 20th century, driven in part by studies on smoking and lung cancer (Lash et al.,  2021 ; Vandenbroucke et al.,  2016 ). Theoretical frameworks include the simple but much cited viewpoints formulated by Austin Bradford Hill ( 1965 ), and more elaborate theoretical frameworks such as the Sufficient‐Component Cause Model (Lash et al.,  2021 ). The subject of causality has also been elaborated by Pearl ( 2009 ) and Pearl and Mackenzie ( 2018 ). Moreover, well‐defined counterfactual conditionals can be used in causal reasoning as valuable tools for forming intermediate steps towards supporting causal claims (Hernán & Robins,  2020 ). In general, finding a statistical association in epidemiological studies with observational design is not per se enough to assume an association is causal. The Bradford Hill ( 1965 ) paper has been very influential in the development of systematic assessment of evidence of causality. With his nine viewpoints, Hill laid a sound framework for assessing causality; some are specific to assessing an individual epidemiological paper, but most are directed at synthesising evidence across different types of studies. The features that he proposed were as follows: Strength of the observed association, consistency across repeated studies, specificity of the association, temporality – exposure preceding effect, a gradient of effect or dose–response relationship, biological plausibility – mechanistic evidence or support from animal studies, coherence between different types of epidemiological observations – the observed association should not contradict any previous knowledge available about the disease and/or exposure, experimental evidence, analogy with comparable causal associations with other exposures. He emphasised that his systematic approach serves to guide the assessment of the strength of evidence of causality and cannot be used mechanically to yield a yes/no decision. None of my nine viewpoints can bring indisputable evidence for or against the cause‐and‐effect hypothesis and none can be required as a sine qua non. What they can do, with greater or less strength, is to help us to make up our minds on the fundamental question – is there any other way of explaining the set of facts before us, is there any other answer equally, or more, likely than cause and effect? (Hill,  1965 ). None of my nine viewpoints can bring indisputable evidence for or against the cause‐and‐effect hypothesis and none can be required as a sine qua non. What they can do, with greater or less strength, is to help us to make up our minds on the fundamental question – is there any other way of explaining the set of facts before us, is there any other answer equally, or more, likely than cause and effect? (Hill,  1965 ). The nine viewpoints and the questions to answer when assessing them are listed in Appendix  A of this document. The original Bradford Hill viewpoints have been modified and adapted for toxicology (Adami et al.,  2011 ), and applied within the Mode of Action framework for comparative analysis of the weight of evidence (WoE) (Meek et al.,  2014 ). A useful tool for characterising biological plausibility for a toxicological exposure/disease association is the 'Adverse Outcome Pathway' (AOP) approach. An AOP is an analytical construct describing the sequential chain of causally linked events at different levels of biological organisation that lead to an adverse effect. These should be included, if available, in the hazard assessment for exposure. The Bradford Hill viewpoints can also be used to assess the WoE that supports an investigated AOP, and for making a judgement on how strong the evidence is to support a particular investigated mode of action (Gross et al.,  2017 ). They have also been used to develop approaches to evaluate the confidence in a whole body of evidence when making inferences of causality (GRADE, (Grading of Recommendations, Assessment, Development, and Evaluations) and modified GRADE Approaches (Morgan et al.,  2016 )). Another influential framework on causality is the Sufficient‐Component Cause Model . This model is centred around the fact that disease causality is multifactorial, meaning that in most cases several component causes need to act together or sequentially in order to complete a sufficient disease cause (Rothman,  1976 ). Moreover, several different sufficient causes may lead to the same disease. The more component causes that are known, the more complete is the causal picture of the disease, which allows for more targeted and accurate interventions for prevention of the disease. Such component causes or risk factors are investigated in both experimental and observational epidemiological studies. In recent years, the use of causal diagrams in epidemiology or directed acyclic graphs (DAGs) has increased and these are very helpful for careful planning of a study design and analysis. While not entirely consistently used, they help both study investigators and readers of their papers with guidance on the causal relationships of outcomes and risk factors (Tennant et al.,  2021 ). DAGs thus provide the investigator with a simple and transparent way to identify and demonstrate their knowledge, theories and assumptions about the causal relationships between variables, and detailed guidance is available on how to prepare DAGs (Hernán & Robins,  2020 ; Lash et al.,  2021 ). Experimental studies, when they are feasible, for example for short‐term effects of exposure, are better suited than non‐experimental studies to determine if a certain exposure is causally related to a given outcome. When available and of good quality, these studies are generally considered the ideal design when making judgement on causality. Often the absence of an effect in such studies is considered a strong argument for 'no evidence for effect'. Such interpretations are however only valid in sufficiently powered studies where RoB is low, compliance is high, and dropout is low. These conditions can more easily be met in nutritional interventions with vitamins, minerals and other supplements where the assigned intervention requires modest commitment from the participants. However, when the assigned intervention requires substantial changes in habitual lifestyle, these conditions become more difficult to achieve. This is, for example, the case for some dietary intervention studies. Examples of such studies include interventions aimed at reducing risk of non‐communicable diseases such as cardiovascular disease (CVD) (Howard et al.,  2006 ) or individual CVD risk factors (Tang et al.,  1998 ) through assignment to complex dietary regimes (in this example low‐fat diets rich in whole grains, fruits and vegetables). In such studies, observed changes in dietary habits between intervention and controls have generally been modest and far from the goals set out for dietary changes. In such studies, compliance may decrease considerably over time, thus hampering the reliability of long‐term intervention studies. That is the idea that one can randomise and ask people to change their lifestyle habits substantially over several months or years and see if they experience lower disease frequency is subject to substantial methodological challenges that may, if not overcome, provide limited evidence for or against causality. Compared to experimental studies which involve randomised allocation to exposure, observational studies are more prone to bias, particularly confounding. To make statements on causality based on their results, replication of findings in different study populations, where confounding factors may differ, and taking other lines of evidence into consideration are usually needed to build a strong case for causality (EFSA Scientific Committee,  2017 ). One point that is sometimes made is that a case for causality can only be made from observational epidemiology by relying on prospective cohort studies. This view, however, ignores the fact that different designs often complement each other, particularly when possible sources of bias differ. As an example, when studying diseases, which have a relatively long latency period, such as cancer, cohort studies may suffer from large dropout of participants during follow‐up periods, which properly designed case–control studies can bypass. Another example is that cohort studies may not have information on potential confounders relevant for the outcome being examined (e.g. lifestyle factors), whereas case–control studies may have this information. Thus, if the case–control studies indicate that there is little or no confounding by a specific factor, or that such a confounder would have biased the study estimates towards the null, this suggests that any observed increased risks are unlikely to be due to confounding from this factor. For studying long‐term effects of exposure, there are examples where observational studies are more suitable than experimental studies and the only possible source to identify causal relationships, such as when assessing the safety of food supplements, food additives, or pesticides post‐marketing. One famous example from the area of safety assessments of pharmaceuticals is the marketing of oral contraceptives in the 1960s. A few years later (in the 1970s), observational studies started to show a consistent association between oral contraceptives and venous thromboembolism, an outcome that previous clinical trials lacked power to detect. Based on these findings, the ethinylestradiol dosage in these pills was reduced substantially, which was associated with less side effects in subsequent studies (Dhont,  2010 ). Decision on how to use evidence from an epidemiological study in a scientific assessment should be supported by a rigorous appraisal. This includes assessment of individual studies in terms of their internal validity , which is the degree to which the observed findings from a given study or experiment are unbiased and accurate for the population studied. That is, a study of appropriate design conducted and analysed to minimise RoB and chance findings has high internal validity. In the section below, key concepts on how to assess and appraise epidemiological studies are introduced. This covers both practical issues relating to understanding and interpreting exposure and outcome measures and a brief description of the main sources of biases. A more practical application of these concepts is then introduced in Section  4.3 . Frequency measures refer to discrete variables that describe distributions of outcome, exposure or covariate measures such as, disease status, mortality, occupation and smoking. Although frequency measures are generally described as proportions or percentages, two key concepts for defining categorical outcome measures in epidemiology are prevalence and incidence: Prevalence refers to the proportion of cases in a defined population at a given time . Incidence rate refers to the rate per unit of time at which new cases are occurring in a defined population . Prevalence refers to the proportion of cases in a defined population at a given time . Incidence rate refers to the rate per unit of time at which new cases are occurring in a defined population . The prevalence and incidence rate are useful measures for describing how frequently a given outcome occurs (at a certain point in time) and the rate at which it is occurring (over time). In comparing exposure groups, the common approach is the comparison of ratios as measures of effect. The most common ratio measures are explained in B ox 1 . More complete descriptions can be found elsewhere (Dohoo et al.,  2009 ; Lash et al.,  2021 ). Measures of effect are indexes that summarise the strength of the association between exposures and outcome. Effect measures can be expressed in both relative and absolute terms. The relative effect measure comparing, for example, an exposed to a non‐exposed population, can be called 'relative risk' and can be expressed as a ratio of incidence rates, ratio of prevalences, ratio of cumulative risks or can be estimated by the odds ratio. Measures of effect from prevalence in cross sectional studies or cumulative risk in cohort studies : Let us assume we have two groups (1 and 2) that differ both in exposure and occurrence of a given outcome. The probability or prevalence ( p ) of the event occurring in Groups 1 and 2 is then: p 1 = a N 1 , where a is the number of events and N 1 is the total number of subjects in Group 1. p 2 = b N 2 , where b is the number of events and N 2 is the total number of subjects in Group 2. The risk ratio of an event occurring in Group 1 compared to Group 2 is then Risk Ratio, RR = p 1 p 2 When the event incidence takes into account time at risk, the effect measure becomes the rate ratio (also for the case of Cox regression called the hazard ratio): That is, the number of new cases (events) occurring divided by the number of person‐years at risk (e.g. if 10 people are each followed for 10 years, this involves 100 person‐years of follow‐up) Then the rate ratio is defined as Rate Ratio = λ 1 λ 2 , where λ 1 and λ 2 are the rates in Groups 1 and 2, respectively. Relative effect measures are commonly used in epidemiological studies as they provide direct measure of the strength of an association between exposure and outcome. On the other hand, absolute difference measures such as the risk difference ( p 1  –  p 2 ) or the rate difference (λ 1  – λ 2 ) provide a direct measure of excess risk of outcome (or disease) between two groups. Measures of effect in case–control studies Case–control studies compare exposures and other factors in cases in the source population (over the follow‐up period) and a sample of the non‐cases. In case–control studies the incidence of the outcome cannot usually be estimated, depending on how subjects are recruited. The outcome measure in a case–control study is the odds ratio , the ratio of odds of exposure in the cases to the odds in the referents. The odds of exposure in each group are the ratio of the proportion exposed ( p ) divided by the proportion of no (1 –  p ). The odds ratio is then the odds of an event in Group 1 divided by the odds of the event in Group 2: Odds Ratio, OR = odds 1 odds 2 = p 1 1 − p 1 p 2 1 − p 2 . What this relative effect measure is estimating depends on how the controls were chosen. In most case–control studies, the odds ratio from the case–control study corresponds to the rate ratio from the corresponding cohort study. Sometimes the OR is used as an outcome measure in cross‐sectional and cohort studies. In such cases, the OR generally overestimates of the ratio of prevalence or cumulative risk between exposure and outcome. However, for rare outcomes (< 10%), the value of the OR is not too different from the Risk Ratio. Different views exist on whether measures of relative risk or absolute risk (see Box  1 ) are more appropriate for evaluating and interpreting effects or associations from epidemiological studies. However, the argument can be made that both are necessary to evaluate findings and 'one cannot be interpreted without the other' (Noordzij et al.,  2017 ). To give an example, let us say that in a well‐defined community the prevalence of perinatal mortality has increased from 0.11% to 0.44% and one suspected cause is a dramatic increase in exposure to an environmental contaminant (e.g. contamination by accidental release of wastewater contaminated with mercury into a nearby aquatic environment). In terms of measures of effect, the absolute risk difference is 0.33%, which for the individual is quite small. At the community level, such an increase in perinatal mortality would also, perhaps, not be noticed in the absence of complete registration and publication of summary statistics from relevant authorities. However, the risk ratio (RR) is as large as 4.00 (OR is 4.01). To take another example, let us say that in a RCT of a food supplement an unexpected side‐effect is revealed. At baseline, the prevalence of hypertension among study participants is 28.7% in both intervention and control groups. However, at the end of the study period, the prevalence in the intervention group was 34.5%, but 28.8% among controls (placebo). The risk difference here is 5.7%, which could be considered as relevant. The risk ratio here is only 1.20 (OR is 1.32). To conclude, absolute risk measures are the most relevant measures when assessing the population impact of exposure. However, when quantifying effect size or strength of an association, relative risk estimates are more appropriate. A thorough evaluation of any association or effect reported in a study requires careful weighing of the actual effect size, the severity of the outcome and its implications for the individual and the community/population. Ideally, sufficient information allowing translating relative outcome measures to absolute measures should be reported in any publication, but the absence of the latter should not be used to downgrade studies, at the appraisal step (see Section  4.3 ). The approach of modelling absolute risk is also used in one particular tool in risk assessment: benchmark dose (BMD) modelling (EFSA Scientific Committee,  2009 , 2017 , 2022 ). This approach was developed for toxicological studies with different groups of laboratory animals (e.g. rats or mice) exposed to several doses of a compound being tested. The absolute risk of developing disease (e.g. inflamed liver) increases from background rate at very low doses to very high or all of them at the highest doses. Based on fitting a smooth line through this data, the dose at which a fixed proportion being affected, say 5%. This can be used as a point of departure to set a protective level by taking into account the confidence interval of the estimate and adding safety factors. This methodology is now sometimes being adopted and applied to epidemiological data (WHO,  2010 ), with, for example, the absolute effect (e.g. IQ) related to the exposure level (e.g. lead in blood), and the BMD estimated for a fixed effect, in this case a shift of one IQ point (EFSA CONTAM Panel,  2010 ). BMD modelling is useful as a tool for establishing HBGVs but is not in itself a tool for assessing causality, which needs to be done based on integrating the strands of evidence from multiple studies and sources. Further reflections on this approach are presented in Section  4.4.3 . In controlled experimental animal studies, the investigator usually has control over the exposure conditions and their changes for the whole duration of the experiment. In such cases, major exposure misclassifications are largely confined to lack of compliance by study participants or other deviation from intended treatment. In humans, similar control over exposure conditions may be achieved in highly controlled metabolic trials that can, for ethical and practical reasons, usually be conducted over a limited time period. For other experimental studies, including many RCTs, the investigator has less control, as exposure is only assigned and not always adequately monitored. As an example, in an RCT testing the effect of long‐chain omega 3 fatty acids supplementation on blood pressure, the effect estimate, in strict terms, measures the average effect of administering the supplementation. That is, the average effect over those taking the supplement and those who did not (or did something else). Therefore, compliance with the treatment allocation should be carefully ensured and monitored, whenever possible, throughout the experiment. Exposure misclassification due to departures from the allocated treatment tends to distort the measured effects towards null, with some exceptions (Yland et al.,  2022 ). In observational epidemiological studies, the investigator does not control the exposure conditions. Therefore, the assessment of exposure must rely on laboratory measurements or other proxies of the exposure itself, such as questionnaires, historical records, geographical information systems, environmental modelling techniques and other tools. In such settings, the key challenge is not only to assess exposure in a reliable way, but also to do that in the appropriate time window, assuming that exposure duration and amount were consistent with a causal effect, and biologically plausible. What can be considered as 'acceptable' or 'valid' in exposure assessment depends, however, markedly on the exposure and outcome under study. For example, a single blood measure of a persistent substance such as dioxins that has an elimination half‐life of several years could be considered a reliable marker of long‐term exposure and of relevance for most long‐term health outcomes, including chronic disease such as cancer or CVD. The same would not apply for a non‐persistent compound such as caffeine, which has an elimination half‐life of a few hours and whose body levels may markedly change over time. For caffeine, therefore, one or more objective measurements from blood samples would be enough to examine short‐term effects on blood pressure, but repeated measurements in blood stretching over longer time period would be needed to reliably assess possible effects on disease such as stroke and other CVD. Despite blood measurements of a compound being an objective measure, substantial long‐term exposure misclassification for single measurements may occur due to individual variation in uptake and excretion. In a questionnaire, a simple question on behaviour, including habitual coffee, alcohol intake or smoking, can often be considered reasonably accurate measures of exposure, as such habits can be assumed (or have been shown) to stay rather constant over time for most individuals. However, self‐reported exposures are often considered inferior to objective methods as, for example heavy smokers (or drinkers) are more likely to selectively underreport their habits. Objective methods are generally preferred, but when such methods do not exist or are not used, a RoB should not automatically be assumed. As an example, for smoking the use of urinary cotinine measurement as an objective biomarker can be useful to quantify exposure misclassifications, compared to relying on self‐reported estimates only. Exposure misclassification in epidemiological research may, however, also occur when 'objective' methods for assessing exposure are used. For example, providing subjects with a fitness watch to objectively measure physical activity may result in an activity higher than usual, simply because study participants have become motivated to use the instrument. It is also well known that use of dietary records can result in changes in dietary habits during the period of recording as some foods are more difficult to weigh and record than others. In addition, such records cannot generally assess rare or highly seasonal food consumption in a reliable way. Another example is use of 24‐h urine sampling, which allows for accurate assessment of exposure to several substances over the past day. However, the burden of collecting all urine excreted during that period may lead to subjects becoming less mobile (or behaving differently), resulting in changes in exposures that would not normally occur, or may decrease the number and completeness of participant recruitment due to lack of participation. Therefore, the simple act of trying to capture exposure with high precision may lead to biased estimates due to behavioural changes. In addition, even when an ideal biomarker of exposure, such as the determination of a substance in one or preferably multiple 24‐h urine samples, is not available, determining such a substance in a less adequate matrix, such as in one or more random urine or morning samples, may still provide a useful estimate. By considering the strength and the limitations of the methods applied for exposure assessment, a more appropriate use of the available evidence can be made in the risk assessment process. Based on the discussion above, a brief summary of strengths and weaknesses of different exposures measures commonly used in human studies is outlined in Table  1 . Overview of the major strengths and limitations of different exposure measures frequently used in epidemiological studies. For example, clinical or other public health records containing information on past exposures (such as smoking or use of supplements or medication) or monitoring devices (such as fitness watches, air pollution monitors). Also includes occupational records (on past exposure). In randomised controlled trials and other experimental studies. One common practice when examining continuous exposures in observational studies is to divide the exposure variables into categories, using a priori or data‐driven (percentiles) cut points of exposure. The dose–response is then examined relative to one reference exposure category. One reason why this approach has historically been used is that the resulting effect estimates from quantile analyses provide simple representation of the underlying dose–response relationship that is easy to interpret in comparison to, for example, effect estimates obtained from non‐linear regression. The use of quantiles does, however, lead to some loss of precision and other adverse consequences (Rothman,  2014 ) and the pros and cons of this approach are discussed in some detail in Appendix  G . Effect measures, as estimated in epidemiological studies, represent an estimate of the underlying true parameter in the reference population. To make inferences about such parameters, uncertainties around the statistical (or central) estimate need to be considered. This is done by estimating the confidence interval which accounts for random errors 15 in the exposure and outcome. In general, the larger the sample size, the higher the precision, which is reflected by narrower confidence interval around the central estimate. In terms of reporting effect measures, both the central estimate and its confidence interval should be reported. The p ‐value may provide useful supplementary information, but there is growing consensus that significance testing involving arbitrary cut‐points (e.g. p  < 0.05) may not be appropriate (Amrhein et al.,  2019 ; EFSA Scientific Committee,  2011 ; Greenland et al.,  2016 ; Wasserstein & Lazar,  2016 ). For further discussion on this issue, the reader is directed to Appendix  B Hypothesis testing vs. estimation. Similarly, as for the effect measure from a single study, effect measures from several studies (or experiments) should, in the absence of systematic bias, follow a distribution affected only by random (study‐specific) errors that are symmetric around the true estimate. It is, however, well known that publication bias can occur when the probability of publication of study results is correlated with the reported effect size (or statistical significance), i.e. when small effect sizes (or non‐significant results) are systematically underrepresented in the available (published) body of evidence. As a result of publication bias, the body of available evidence may bias the summary of evidence away from the null in cases where there is truly no effect or skew the estimate from its actual value when an effect truly exists. A similar bias would result from a selection process of published studies for evidence integration. Thus, both the selection of results for publication and the selection of published studies in evidence integration should be independent of reported effect sizes. Mandatory pre‐registration of clinical trials can mitigate publication bias. 16 A worldwide voluntary pre‐registration of studies involving animals has been launched recently (Bert et al.,  2019 ). A pre‐registration and/or a publication of the protocol of observational epidemiological studies, as well as of systematic reviews and meta‐analyses, can be assumed to have similar positive effects. Systematic errors differ from random errors in as far as the former would be present even in an infinitely large study, whereas random errors can be reduced by increasing the study size. Thus, systematic errors (or 'bias') occur if a systematic difference between the true value and the measured value exists (Pearce,  2005 ). Systematic errors are usually classified into three types of bias: information bias, confounding and selection bias. Information bias concerns misclassification of the study participants with respect to exposure, outcome or confounder status. Usually, two types of misclassifications are considered: non‐differential and differential misclassification. Non‐differential misclassification occurs when the probability of misclassification of exposure or health outcome is the same for cases and non‐cases, i.e. exposed and non‐exposed persons are equally likely to be misclassified according to disease outcome; or diseased and non‐diseased persons are equally likely to be misclassified according to exposure. With some exceptions (Yland et al.,  2022 ), non‐differential misclassification of exposure biases the effect estimate towards the null and tends to reduce the size of the effect which is of particular concern in studies which find weak associations (Pearce,  2005 ). Differential misclassification occurs when the probability of misclassification of exposure is different in cases and non‐cases, or the probability of misclassification of disease is different in exposed and non‐exposed persons. This can bias the observed effect estimate either towards or away from the null value (Pearce et al.,  2007 ). For example, in a case–control study of lung cancer, the recall of past exposures, e.g. smoking, might differ in cases from that of the controls, leading to differential misclassification. This could bias the odds ratio towards or away from the null (value of 1.0). While several detailed definitions of confounding exist (e.g. Lash et al.,  2021 ), in this document, a confounder is referred to as a variable (or factor) that is associated with both the exposure and outcome, resulting in a spurious association between the two. Confounding is to be expected if the factor of interest is associated with a different factor (the 'confounder') which is a known or unknown risk factor for the outcome of interest. For example, assume that the exposure to substance X (risk factor of interest) is associated with co‐exposure to cigarette smoke (confounding factor), i.e. individuals who are exposed to higher concentrations of substance X also smoke more cigarettes compared to the unexposed; and smoking is also causally related to the outcome of interest. When confounding is not considered, the potential effect of the risk factor of interest may be mixed with the effect of the confounder or even entirely explained by that confounder. Consequently, the statistical effect estimate is biased with unknown magnitude and direction. It is a matter of subject expertise to identify potential confounders, to plan collection of confounder information at the design stage, to adjust for confounders in the analysis and to consider the possibility of residual confounding 17 in the interpretation of the study results. DAGs are an increasingly popular approach for identifying confounding variables that require conditioning when estimating causal effects (Tennant et al.,  2021 ). Confounding can be mitigated by the design of the study and through 'adjustment' in the statistical data analysis. An ideal study design to control confounding ensures that the expected variation of all potential confounders is identical across all levels of the main risk factor. RCT are epidemiological studies in which this is theoretically possible since allocation of intervention or treatment (main study factor) to the participants is at random. Thus, potential confounders should (on average) be evenly distributed in all treatment groups. Confounding can still occur in an RCT due to imbalanced distribution of confounding factors across treatment groups. In observational studies, where participants are not randomised, confounding is more likely to occur. If potential confounders have been identified in the design of the study, and the respective information on all confounders is collected at the individual level, it is possible to statistically adjust for confounding. Several approaches for confounder control exist (Kestenbaum,  2019 ). One approach for this involves stratification by the confounding factor and construction of a weighted effect estimate (e.g. the Mantel–Haenszel odds ratio estimate). Multivariable models provide a similar adjustment (correction of confounding bias) and offer the additional flexibility to accommodate categorical as well as continuous risk factors. The fact that a risk model is adjusted for one or several confounding factors does not give a full guarantee against confounding bias. It requires a case‐by‐case expert judgement from a subject matter and statistical modelling viewpoint to decide whether potential confounding is adequately addressed. While the technique of matching can be used to prevent confounding (from the matching factor) from occurring in cohort studies, in case–control studies, it may also lead to the opposite result, i.e. introduce a selection bias that behaves like confounding, because it may violate the principle of selecting controls at random from the source population. In practice, matching may artificially bring the exposure distributions in the cases and controls closer together than they really are in the source population (overmatching) and therefore introduce bias. Therefore, in matched case–control studies, the matching factor will in most cases need to be controlled for in the analysis (Pearce,  2016 ). Other methods of controlling for confounding such as weighting and propensity scores can also be applied (Lash et al.,  2021 ). In observational epidemiological studies, usually more than one factor will differ between the compared groups, in which case they could all be potential confounders. For this reason, the results of such studies are always subjected to multiple regression analysis, which allows for the adjustment of the effect estimates for several factors simultaneously in the same statistical model. That means that the effect estimates obtained from such modelling are unconfounded by the effects of the other factors that are included in the same model (provided that these other factors have been defined appropriately and measured accurately). Residual confounding may still exist for several reasons, including (1) other confounders that have not been included in the model, (2) imprecise measurement of one or more confounders controlled for, or (3) inappropriate modelling of the confounder in the statistical analyses. Even though it is very important to evaluate the appropriateness of the statistical model used, the validity of the respective assumptions, and the model building strategy, etc., this is a very technical issue which is beyond the scope of this document. It is advised that for this task the assistance of a statistician or an epidemiologist be requested. Selection bias is an important systematic error in observational studies. It involves bias arising from how the study participants are selected (or select themselves) from the source population. It thus arises when the relation between exposure and disease in the study population (i.e. the actual study participants) differs from the relation in the source population from which study participants are drawn (Lash et al.,  2021 ). In general, selection bias occurs as a result of the procedures used to select study participants (Pearce,  2005 ). Because usually only information from the recruited study population is known, selection bias must typically be evaluated indirectly or theoretically, and anticipated in the study design. It may be possible to ‘correct’ selection bias in a study, if the factors influencing selection can be controlled for in the analysis (in the same way that confounders can be) (Pearce,  2005 ). This requires, however, that additional information (on these factors) needs to be available for all study participants. Selection bias could occur, for example, when people enrolled in a cohort study are self‐referred. One such example would be if people self‐referred to a study, knowing that they had the studied exposure and suspecting that they may also have the outcome (maybe experiencing relevant symptoms). Selection bias would occur if these people would indeed have a higher probability of the outcome compared to exposed people in general. Selection bias can be related not only to ‘selection’ to enter a study, but also to a ‘selection’ to exit a study. In this sense, the bias resulting from a loss to follow up (persons lost to the study investigators before the end of the study) that is differential between the two compared groups (e.g. exposed and non‐exposed) is also a form of selection bias (Hernán et al.,  2004 ). Selection bias can also result from using an inappropriate control group in a case–control study. In these studies, the purpose of the control group is to provide an estimate of the distribution of exposure in the source population from which the cases originate. A control group may fail to provide this information, when, for example, the population from which the cases originate is not appropriately defined, or selection of controls is based on convenience rather than on specific criteria that need to be fulfilled. For a detailed discussion on selection of controls in case–control studies, the reader is referred to Wacholder, McLaughlin, et al. ( 1992 ); Wacholder, Silverman, et al. ( 1992b , 1992c ). Confounding generally involves biases that can occur even if everyone in the source population took part in the study as they are inherent in the source population. Selection bias, on the other hand, covers biases that stem from the procedures that are used to select the study participants from the source population. As a result, selection bias is not an issue in a cohort study with complete follow‐up, as the study cohort composes the entire source population. Selection bias can, however, occur if participation in the study or follow‐up is incomplete or if the response rate depends on the exposure and outcome (e.g. overrepresentation of heavily exposed persons who are more likely to be diagnosed with disease (Pearce,  2005 )). Similarly, selection bias is not an issue if a case–control study involves all cases in the source population (and risk period) and the controls are a random sample of the source population, and the response rate is 100%. However, selection bias may occur if response varies by exposure and disease status. A key issue of epidemiological research is to identify and assess the extent to which the effect of an exposure may depend on the level of one or more other factors and whether or not such factors may have an independent causal effect on the endpoint under consideration. Such factors are described as effect modifiers, the underlying concept being the existence of interactions between two or more factors. For example, if an exposure vs no exposure has a rate ratio for the outcome of 3.0 in men and 1.5 in women, sex may be an effect modifier because the effect of the exposure seems different in men and women. Identification of effect modifiers has a key relevance in both scientific research and risk assessment, and also plays a crucial role when assessing the external validity of study findings. Therefore, assessment of interactions has become a key goal of scientific research, in order to identify higher susceptibilities to adverse or beneficial effects of a given exposure, due to other exposures or endogenous factors (such as children, pregnant women, diseased persons, individuals with specific dietary/life‐style habits or genetic backgrounds). Effect modification is entirely different from confounding, since the former concerns the ability of one factor to modify the causal effect of another factor on a defined endpoint. Effect modification occurs when an exposure has a different effect among different subgroups; hence, it is associated with the outcome but not the exposure. Therefore, understanding of effect modification is necessary to characterise causal association, interactions and susceptibilities, which is important in risk assessment. Confounding. on the other hand. must be minimised when planning a study or controlled for at the analysis stage. Effect modification may be assessed either as statistical interaction or biological interaction (Lash et al.,  2021 ). Statistical interaction is just a departure from the basic form of a statistical model and is therefore dependent on the metrics used in the statistical model, e.g. multiplicative vs additive models. Biological interaction describes the mechanistic interaction between causal factors, assessing the departure over additive effects of the combination of single risk determinants. It amounts to an attempt to identify susceptibility factors, which are factors that modify the effect of an exposure on a specific health outcome. Unlike statistical interaction, it is a biological phenomenon. The assessment of biological interaction requires considerably more data than the assessment of the effect of a single factor and may involve the net effect of factors that are causes and preventives in varying combinations (see Section  4.1.5.1 ). Generally, epidemiological studies conducted in the target species have clear advantages in terms of relevance for risk assessments over studies conducted in non‐target species, as uncertainties due to between‐species extrapolation are eliminated. It can often be assumed that the exposure conditions in observational settings, if appropriately captured, are more similar to real conditions in terms of duration, concentration of exposure and other circumstances than in experimental studies where the exposure conditions are chosen by the investigator. A refined assessment of the relevance of the evidence from epidemiological studies for risk assessments requires that the choice and characteristics of the study population, the selection of study participants, the exposure conditions as well as the case definition and the measurement of the outcome be evaluated with respect to the specific research question. When assessing external validity several different concepts should be distinguished: There is a target population to which we wish to draw inferences (e.g. all people in the EU, all people on the planet) There is a source population which is used as the source of participants for a particular study (e.g. everyone living in Parma, British doctors) There is a study population , i.e. the group of people who actually take part in the study, with some of the source population not taking part either due to selection by the investigators, or self‐selection (i.e. non‐response) There is a target population to which we wish to draw inferences (e.g. all people in the EU, all people on the planet) There is a source population which is used as the source of participants for a particular study (e.g. everyone living in Parma, British doctors) There is a study population , i.e. the group of people who actually take part in the study, with some of the source population not taking part either due to selection by the investigators, or self‐selection (i.e. non‐response) External validity refers to whether the study findings can be generalised to the target population Provided that disease outcome has not affected the choice of source population, and if 100% of the source population is included in the study, there can be no selection bias. Rather, any differences between the results in the source population, and what would have been obtained from studying the whole target population is a result of confounding (different confounding structures) and/or effect modification (see Section  4.2.1.5 ). In most studies, the ‘target population’ is left undefined, with the implication that the findings are intended to apply to the general population. In fact, there is no need to invoke some hypothetical target population to validly design and analyse a study. In more simple terms, generalisability is a matter of expert judgement, not statistical considerations (Lash et al.,  2021 ). When studying exposure–health relationships in the target population, lack of representativeness may not be a problem when identifying risk factors (e.g. the original findings for smoking and lung cancer included a study in British doctors). External validity is of particular relevance for descriptive studies and cross‐sectional surveys aimed at determining disease frequency or other characteristics in a given population. If the study population recruited in these studies is representative of the source population, statistical inferences may be made about the characteristics (parameters) in the source population, based on the information from the study population. Representative samples can be obtained using specific probability sampling techniques. However, sometimes, it is not possible to obtain representative samples from a population. In those cases, it is very important to consider if the population has been sampled selectively, based on specific factors, and which ‘sub‐population’ the sample may represent. For example, testing cattle at the slaughterhouse for bovine paratuberculosis, a chronic progressive infectious disease of cattle, may not provide a representative picture of the disease in the entire bovine population of the area served by the slaughterhouse. Animals at the slaughterhouse may have more advanced infections than in the ‘general’ populations of cattle in the area, because they might have been sent to the slaughterhouse due to their infection or due to old age. Or conversely, animals with very advanced cases might have already died or euthanised at the farm and never made it to the slaughterhouse (Nielsen et al.,  2011 ). External validity and internal validity can be interlinked following various patterns (Steckler & McLeroy,  2007 ). For example, controlled experimental studies have, for reasons explained above, generally lower RoB than observational studies. This higher internal validity often comes at a price. For example, adverse effects of chemicals can only be studied under experimental conditions in animals, which then have to be extrapolated to humans where such experiments are not feasible. Conversely, when human data exist, addressing uncertainty in terms of external validity means relying on human observational studies, which generally have lower internal validity, compared to experimental studies in humans conducted in highly selected populations. Moreover, external validity is affected by the population characteristics of the studies comprising the best available evidence. As an example, RCTs testing the effect of pharmaceuticals or individual nutrients on health are usually conducted in specific populations that are most likely to benefit from treatment (pharmaceuticals) or demonstrate some beneficial effects (nutritional RCTs). Such a selection may, however, hamper extrapolations to the more general population. For example, results from a RCT showing modest increase in cancer risk as a result of beta carotene supplementation in male smokers (The Alpha‐Tocopherol Beta Carotene Cancer Prevention Study Group,  1994 ) may provide a reasonable argument for not taking beta carotene as a food supplement for cancer prevention. On the other hand, it could be argued that these results would perhaps not be the same if conducted in healthy non‐smokers who have a much lower cancer risk. In terms of extrapolating such exposures to more real‐life setting, such increase in cancer risk due to use of beta carotene supplements is not comparable to exposure to beta carotene from the habitual diet. Similarly, findings of increased mortality in postmenopausal women with underlying CVD following supplementation with vitamins C and E (Waters et al.,  2002 ) could be considered to have modest to low external validity for the general population. In terms of making conclusions on causality, both internal and external validity need to be considered. In Section  4.1 , a brief description of different experimental and non‐experimental studies was given, highlighting their main strengths and limitations. In this section, different types of biases that may occur in each of these designs have been explained. When interpreting the findings of epidemiological studies and assessing the evidence generated by them, there is sometimes a preference towards ranking studies in terms of internal validity by their design (design hierarchy). This usually translates to emphasising the role of experimental studies (RCTs) and, among the non‐experimental ones, that of cohort studies. However, such ranking is often not justified as the examples and discussions above have tried to highlight. In fact, all study designs are more (or less) prone to biases. In the absence of an empirical basis for the relative importance of biases in a given research area, it can be misleading to infer bias proneness from study design only. Some typical biases that may occur in different study designs are briefly summarised in Table  2 . Study designs and typical biases. General note on Table  2 : a lack of external validity may be an issue with all study designs, e.g. patients included in a RCT may have severe disease and the findings may not be generalisable to mild disease, a study conducted in men may not be generalisable to women, in adults to children, etc. The letters are explained in the text below. Selection bias Selection bias at baseline is not usually a concern in RCTs, provided that the study is sufficiently large, if allocation is adequately randomised and concealed after the study participants are selected; selection bias may occur due to loss to follow‐up if this differs by treatment group or outcome (Hernán et al.,  2004 ). Selection bias in cohort, case–control and cross‐sectional studies may result from the way in which the study participants are selected (or select themselves) from the source population, leading to them being unrepresentative of the source population in terms of exposure or outcome. Selection bias can also occur due to loss to follow‐up. Selection bias is by definition rare in ecological studies provided that they cover an entire defined population. Selection bias at baseline is not usually a concern in RCTs, provided that the study is sufficiently large, if allocation is adequately randomised and concealed after the study participants are selected; selection bias may occur due to loss to follow‐up if this differs by treatment group or outcome (Hernán et al.,  2004 ). Selection bias in cohort, case–control and cross‐sectional studies may result from the way in which the study participants are selected (or select themselves) from the source population, leading to them being unrepresentative of the source population in terms of exposure or outcome. Selection bias can also occur due to loss to follow‐up. Selection bias is by definition rare in ecological studies provided that they cover an entire defined population. Confounding d Confounding may occur in RCTs if allocation concealment and/or randomisation is not adequate, for example when the study group is small, thus making treatment groups not directly comparable at baseline. e Confounding can occur in all observational designs. Confounding may occur in RCTs if allocation concealment and/or randomisation is not adequate, for example when the study group is small, thus making treatment groups not directly comparable at baseline. Confounding can occur in all observational designs. Information bias f Information bias on exposure (i.e. exposure misclassification) occurs if the treatment groups of RCTs are not maintained (i.e. participants stop or switch treatment); information bias on the outcome occurs if participants receiving the treatment may be subjected to more or less intensive diagnostics compared to the comparison group (lack of blinding, diagnostic bias). g Information bias may occur in cohort studies due to misclassification of exposure or the outcome, e.g. if exposed participants may receive more intensive diagnostics compared to non‐exposed (lack of blinding, diagnostic bias). With some exceptions, non‐differential (random) misclassification of exposure or disease will usually produce a bias towards the null (no effect) and cannot explain positive findings. Differential information bias (e.g. if classification of the outcome differs by exposure status) can produce bias in either direction. h Information bias may occur in case–control and cross‐sectional studies due to misclassification of exposure or the outcome, particularly when the classification of exposure is based on participant recall (recall bias). Non‐differential (random) misclassification of exposure or disease will usually produce a bias towards the null (no effect) – with some exceptions – and cannot explain by itself positive findings. Differential information bias (e.g. if recall is different in cases and controls) can produce bias in either direction. Exposure assessment may be affected by the disease condition (e.g. due to reverse causation). i Information bias is a major concern in ecological studies, since exposure and outcome information is only available on a population and not on the individual level – thus, even if there is an association between exposure and outcome at the population level, it may not be the case that the outcome was more common in the exposed individuals – i.e. there may be an association at the population level but not at the individual level, and vice versa. The assumption that the observed associations can be transferred from the population to the individual level is known as ecological fallacy (Hammer et al.,  2009 ). Information bias on exposure (i.e. exposure misclassification) occurs if the treatment groups of RCTs are not maintained (i.e. participants stop or switch treatment); information bias on the outcome occurs if participants receiving the treatment may be subjected to more or less intensive diagnostics compared to the comparison group (lack of blinding, diagnostic bias). Information bias may occur in cohort studies due to misclassification of exposure or the outcome, e.g. if exposed participants may receive more intensive diagnostics compared to non‐exposed (lack of blinding, diagnostic bias). With some exceptions, non‐differential (random) misclassification of exposure or disease will usually produce a bias towards the null (no effect) and cannot explain positive findings. Differential information bias (e.g. if classification of the outcome differs by exposure status) can produce bias in either direction. Information bias may occur in case–control and cross‐sectional studies due to misclassification of exposure or the outcome, particularly when the classification of exposure is based on participant recall (recall bias). Non‐differential (random) misclassification of exposure or disease will usually produce a bias towards the null (no effect) – with some exceptions – and cannot explain by itself positive findings. Differential information bias (e.g. if recall is different in cases and controls) can produce bias in either direction. Exposure assessment may be affected by the disease condition (e.g. due to reverse causation). Information bias is a major concern in ecological studies, since exposure and outcome information is only available on a population and not on the individual level – thus, even if there is an association between exposure and outcome at the population level, it may not be the case that the outcome was more common in the exposed individuals – i.e. there may be an association at the population level but not at the individual level, and vice versa. The assumption that the observed associations can be transferred from the population to the individual level is known as ecological fallacy (Hammer et al.,  2009 ). This chapter focusses on tools and processes for assessing characteristics of individual studies to enable their quality to be assessed in a thorough and consistent way. Risk assessments undertaken by EFSA are an integral part of health‐related regulatory decision making, a field characterised by large diversity in context, content, methods, information sources and implementation (Diefenbach et al.,  2016 ). A common feature in any decision‐making process is the efficient retrieval, organisation and integration of the available evidence on a specific question or term of reference (Langlois et al.,  2018 ). For EFSA's risk assessments evidence should be retrieved from many and diverse sources. The information derived from each piece of evidence, however, is not necessarily equally relevant as different study design are prone to different sources of bias as described in the previous section. Taking into consideration the above, for a successful integration to be achieved, the selected evidence base must be organised in a way that assigns an appropriate role to each information piece. The process of assigning such a role to each piece or body of evidence is complex and specific to each risk assessment question and context, and such decisions cannot be taken on the basis of study design only. Further guidance on this process can be found in Section  4.4 and in EFSA guidance documents (EFSA,  2015 ; EFSA Scientific Committee,  2017 , 2023a ). Individual study appraisal is organised as follows (Agency for Healthcare Research and Quality,  2002 ): identification of the key elements of the research/assessment question under study (EFSA Scientific Committee,  2023a ) assessment of internal validity (RoB) summarisation of the study appraisal results. identification of the key elements of the research/assessment question under study (EFSA Scientific Committee,  2023a ) assessment of internal validity (RoB) summarisation of the study appraisal results. Clarifying the key elements of a research/assessment question is the starting point of the study appraisal process. A clearly framed question 'creates the structure and delineates the approach to defining research objectives, conducting systematic reviews and developing health guidance' (Morgan et al.,  2018 ). A formal strategy for identifying key elements is essential for: designing the literature search strategy, identifying the studies that by design and conduct best fit the risk assessment needs, clarifying important population characteristics and subgroups, expanding or narrowing the exposure spectrum and defining the different exposure strata, choosing the comparison that best fits the terms of reference among the usually large number of performed comparisons (i.e. combinations of exposure category and the various endpoints), organising and prioritising the relevant endpoints and follow‐up timepoints thereof . designing the literature search strategy, identifying the studies that by design and conduct best fit the risk assessment needs, clarifying important population characteristics and subgroups, expanding or narrowing the exposure spectrum and defining the different exposure strata, choosing the comparison that best fits the terms of reference among the usually large number of performed comparisons (i.e. combinations of exposure category and the various endpoints), organising and prioritising the relevant endpoints and follow‐up timepoints thereof . Before the publication of EFSA's GD on systematic review (EFSA,  2010 ), the PICO/PECO/PO/PIT 18 approach has been the framework most widely adopted in EFSA for defining the key elements of a question and thus structuring the problem formulation process. More recently, a new approach named APRIO 19 has been defined by EFSA's Scientific Committee in its guidance on protocol development (EFSA Scientific Committee,  2023a ). Owing to its cross‐cutting nature, the APRIO paradigm is broadly applicable within and across the various domains of EFSA and is considered to overcome the limited applicability of the existing frameworks in some of EFSA's domains. Therefore, although the PICO/PECO/PO/PIT represents a valid approach in some domains, the APRIO is currently EFSA's recommended framework for problem formulation and is considered preferable, to enhance harmonisation across domains. After clarifying the key elements of a research/assessment question the next step is to assess internal validity of different studies . Internal validity is the extent to which a piece of evidence provides an unbiased estimate of the association between exposure and outcome, i.e. the extent to which the study results reflect the ‘truth’ among the study population. For a given study, assessment of internal validity refers to evaluation of its design and conduct, including reflections on the likelihood, degree and direction of possible biases. Such an assessment can be facilitated by organising the appraisal into various bias domains. Selection bias, information bias and confounding are key domains to be included and can be operationalised to specifically address e.g. classification of exposures, departures from intended exposures, missing data, outcome ascertainment . Critically summarising the appraisal results of a study essentially pertains to the magnitude of the effects and the precision of the point estimates. However, while clarifying what are the main results of the study, various parameters are of considerable importance, such as the proportion of exposed and unexposed, the effect metric used and its appropriateness, the magnitude of the effect in absolute and relative association measures; the reporting of both crude and adjusted effect estimates; the confounders adjusted for; and the implementation of subgroup analysis. The next sub‐section provides a brief description on the development of appraisal and RoB tools and an overview of existing tools. Guidance on the use of RoB tools for assessing internal validity of individual studies and on summarising their results in a systematic manner (points b) and (c) above, respectively is given in Sections  4.4.1.5.1 and 4.4.1.5.2 . In view of the challenges inherent in study appraisal, the practical need of a standardised process has led to the development of various appraisal instruments. Many reviews, inventories and annotated bibliographies of critical appraisal tools applicable to different study designs have been produced with different aims. Some of these exercises have been performed by groups of researchers, such as those developed by the Cochrane Collaboration. 20 Others have been the result of the efforts by risk assessment organisations or governmental bodies which were interested in implementing structured and harmonised approaches in their own assessment processes ( BfR , 21 IARC , 22 ECETOC , 23 NIHS R&D HTA 24 Programme, AHRQ , 25 NTP‐OHAT , 26 EPA‐IRIS , 27 Navigation Guide , USDA‐NESR 28 ). Currently, there are no agreed gold standards and no standardised processes for developing such tools (see Appendix  D ). Critical appraisal tools have been developed for different purposes and contexts such as (1) to appraise single studies; (2) to assess RoB in systematic reviews; and (3) to inform the weighing of the evidence in risk assessments. They cover one or more study designs and can have one of the following structures (Sanderson et al.,  2007 ): summary checklist consisting of only a list of items (e.g. CASP ), a checklist accompanied by a summary qualitative judgement (e.g. EPIQ ), a scale with the list of items and scores attached, which result in a summary numerical score (e.g. Jadad ; Newcastle‐Ottawa ), domain‐based tools (e.g. Cochrane RoB 2.0 ; NTP‐OHAT). summary checklist consisting of only a list of items (e.g. CASP ), a checklist accompanied by a summary qualitative judgement (e.g. EPIQ ), a scale with the list of items and scores attached, which result in a summary numerical score (e.g. Jadad ; Newcastle‐Ottawa ), domain‐based tools (e.g. Cochrane RoB 2.0 ; NTP‐OHAT). EFSA has used the NTP‐OHAT tool in several of its scientific assessments since 2015. The Office of Health Assessment and Translation (OHAT) from the National Toxicological Program (NTP) in the US has outlined operating procedures for systematic review and evidence integration for conducting literature‐based evaluations in environmental health and toxicology (Rooney et al.,  2014 ). They have developed a RoB Tool that applies a parallel approach to the evaluation of RoB for human and animal studies, facilitating consideration of potential bias across evidence streams with common terminology and domains (National Toxicology Program,  2019 ). This approach was developed drawing on several different sources including the most recent guidance from the Agency for Healthcare Research and Quality (Viswanathan et al.,  2012 ), the Cochrane RoB tool for non‐randomised studies of interventions (Sterne et al.,  2016 ), Cochrane Handbook (Higgins & Green,  2011 ), SYRCLE's RoB tool for animal studies (Hooijmans et al.,  2014 ) and the Navigation Guide (Woodruff & Sutton,  2014 ). The NTP‐OHAT RoB tool is designed to evaluate, through different sets of questions, the internal validity of several of the most common study designs encountered in chemical and nutrient risk assessment. These questions are complemented by detailed criteria ('practices') that define aspects of the study design, conduct, and reporting required to reach each RoB rating. It can be applied to many research questions and tailored to the scope of the assessment. In Appendix  D Relevant inventories and reviews of critical appraisal tools, a table showing a selection of inventories and reviews on critical appraisal tools is provided. This includes a description of their context, objectives and study designs covered. An overview of RoB tools for appraising systematic reviews (i.e. research synthesis) and for appraising individual primary research studies is provided for all types of EFSA assessments in Appendix  E Overview of appraisal tools. Assembling and assessing the evidence from multiple epidemiological studies within a risk assessment framework serves three needs; first, the hazard assessment, i.e. contributing towards the assessment of causality in an association; second, characterising the exposure–response relationship; third, assessing the uncertainty underlying the two previous endeavours via characterising possible biases and consistency, or the lack thereof, across the appraised evidence. The different steps of evidence assessment and evidence integration generally follow a systematic approach, and they are implemented first within a single evidence stream of either human or other studies (e.g. studies using laboratory animals or in vitro studies) (see Figure  1 ). Steps of evidence assessment and evidence integration (HOC: health outcome category, i.e. a combination of similar/biologically related health outcomes into one group). The process of gathering, assessing and integrating the pertinent epidemiological evidence is preceded by a customisation step that renders the process fit‐for‐purpose (see Section  4.4.1.2 ). During this step, evidence mapping is crucial to clarify the boundaries of a scientific assessment during both the mandate‐ and the assessment planning phase (Peters et al.,  2020 ). Towards that end, scoping reviews and evidence maps 29 can be used and the right timing for this process is during the planning phase when developing the assessment protocol (EFSA Scientific Committee,  2023a ). Scoping reviews are broad literature compendia that comprehensively examine the extent, range and nature of the relevant research activity, and identify gaps in an existing body of literature. Thus, they can inform a variety of decision‐making settings, including but not limited to risk assessment and policy making, and can provide direction for future research priorities. The use of evidence mapping and scoping reviews is particularly helpful in scientific assessments where the underlying evidence is characterised by great volume and considerable heterogeneity (both in terms of study design and endpoints under study). As the planning phase is an iterative process, the information from scoping reviews and evidence maps will help to specify the most appropriate methods for evidence synthesis by providing predictions on the amount and heterogeneity of studies that need to be assessed. The use of scoping reviews and evidence maps may also guard against ad hoc changes to the protocol at later stages, which are often made when it becomes clear that the approach originally chosen is not compatible with the available evidence, resources or time constraints. KEY POINTS Consider doing a scoping review/evidence map, especially in data‐rich and/or heterogeneous topics. Frame the APRIO (Agent, Pathway, Receptor, Intervention and Output) based on the Terms of Reference. Consider doing a scoping review/evidence map, especially in data‐rich and/or heterogeneous topics. Frame the APRIO (Agent, Pathway, Receptor, Intervention and Output) based on the Terms of Reference. Consider doing a scoping review/evidence map, especially in data‐rich and/or heterogeneous topics. Frame the APRIO (Agent, Pathway, Receptor, Intervention and Output) based on the Terms of Reference. While mapping the evidence related to a Scientific Assessment (SA) update, the previous SA could serve as a starting point and function as a basis for predictions on how much evidence may have accumulated since the last assessment. There are numerous examples where the difference between a specific SA or Opinion and its update is substantial. One such example is the update of EFSA's scientific opinion on polybrominated diphenyl ethers (PBDEs) in food (EFSA CONTAM Panel,  2011 ) in 2024 (EFSA CONTAM Panel,  2024 ), where more than 200 new human epidemiological studies had been published since the previous assessment and had to be assessed. The performed evidence mapping helped make important planning decisions related to the organisation and prioritisation of health outcome categories, the extent of multi‐congener exposure assessment, and the feasibility of quantitative evidence synthesis of the human data (and the required allocation of resources). The study appraisal process should be tailored to serve the specific purpose of the risk assessment, and this should be done at the stage of protocol development. Following the structure proposed before, the first level of customisation is implemented at the level of the APRIO elements of the research question (EFSA Scientific Committee,  2023a ). Having framed the research question and related sub‐questions that are relevant to the risk assessment, a decision needs to be made which study appraisal elements, during RoB assessments, should be generic and which need to be specific to the study design . The answer to this question largely depends on the capacity of certain study designs to contribute to the risk assessment and the volume of the accumulated evidence. This is specific to the hypothesis under consideration, and the context in which the studies are conducted. Moving to the customisation of the RoB assessment, the inclusion, elaboration and decisions on how to assess various bias domains need to be decided and tailored. For example, specific concerns related to selection bias may have to be addressed as to whether there is bias arising from poor response rate or loss to follow‐up. Other points to reflect on may include what issues should be considered when assessing confounder control. For interventional studies, a specific mention should be made on the feasibility of blinding and what should be considered as realistic compliance on case‐by‐case basis. Exposure assessment is another domain where a priori customisation of the study appraisal process for all study designs is warranted for almost every risk assessment. Methodologically related to information bias, there are particular features within each type of exposure assessment that need to be addressed. For example, certain analytical techniques may serve as gold standards and thus a greater weight may be put on the studies that use them, while evolution of the exposure assessment methodology over time is also something to consider. Although giving priority to different methods for exposure assessment or favouring certain methods is both logical and common, it is important to remember that no method is perfect and different methods may provide complementary information that strengthen each other (see Table  1 , Section  4.2.1.2 ). Similar to the exposure assessment, the endpoint/outcome ascertainment is a feature that should receive the same attention. Criteria for inclusion/exclusion of human studies should be defined a priori based on either the APRIO (EFSA Scientific Committee,  2023a ) or the PICO‐PECO/PO/PIT approach. Exclusion of studies solely based on design, population characteristics, or endpoint attributes should not be done. These study characteristics reflect validity aspects (both internal and external) that should be systematically addressed later in the assessment and will finally inform the ‘WoE exercise’. Conversely, exclusion criteria when assessing eligibility should be focused on the outcome assessment and on the intervention characteristics (for clinical trials) or the exposure assessment/characterisation (for observational studies). Rejections based on those considerations should be based on established scientific knowledge allowing a firm conclusion that the study methodology is not appropriate. The approach described above can be modified in data‐rich situations, where it is known a priori that for certain health outcome categories there is an abundance of large studies with methodological characteristics related to generally lower RoB (e.g. randomised controlled experimental studies, prospective cohort studies) that can directly address the assessment question. An example of such a case is the 2022 NDA opinion on tolerable upper intake for dietary sugars (EFSA NDA Panel,  2022 ), where it was already known at the planning stage that there was a large number (e.g. hundreds) of high‐quality prospective cohort and intervention studies, that adequately addressed the key assessment questions. There, retrospective case–control studies, cross‐sectional studies, ecological studies and case studies/series were captured but not prioritised first. Throughout the assessment, this volume of evidence was recorded and could have been used by the NDA Panel, should the need for further consideration had arisen. When screening for eligibility, special attention should be given to studies suspected of violating basic ethical standards. These may include, for example, older studies performed in mental health or correctional institutions where there is an indication that the informed consent procedure was inappropriate (Newcomer et al.,  1999 ; Sheppard et al.,  2020 ). Inclusion of such studies should be carefully considered. Eligibility screening of studies should be performed in a transparent manner and documented accordingly. It is recommended to be done in duplicate by two independent parties and to follow a tiered approach where titles/abstracts are to be assessed first followed by the full text scrutiny. Disagreements between the involved assessors can be resolved by consensus or by a third arbitrator. The use of relevant software with the capacity of providing a ‘rejection log’ along with brief reasons for exclusion is strongly recommended. Moreover, a summary of the excluded body of evidence is recommended pertaining to the cumulative size of the excluded evidence both in terms of the number of studies and the sample size. Based on this brief report, a proposal can be made about whether the exclusion of this part of the evidence may pose a threat to the relevance, validity and generalisability of the scientific assessment. The eligibility criteria, rejection log and summary of excluded evidence should be accessible throughout the entire risk assessment process. Subsequently, included evidence will be organised within different priority tiers as described in the following sections. KEY POINTS Define inclusion and exclusion criteria a priori based on Terms of Reference . Record rejections. As a general rule, do not exclude evidence based on study design and sample size. Exceptional deviations from this approach should be well justified. Define inclusion and exclusion criteria a priori based on Terms of Reference . Record rejections. As a general rule, do not exclude evidence based on study design and sample size. Exceptional deviations from this approach should be well justified. Define inclusion and exclusion criteria a priori based on Terms of Reference . Record rejections. As a general rule, do not exclude evidence based on study design and sample size. Exceptional deviations from this approach should be well justified. Once the eligibility criteria are set, no further exclusions should be made during the assessment. It is recommended to extract data related to the study design and the pertinent health outcome categories for the eligible studies at the eligibility screening level. If the health outcome category is a disease, ideally the proposed health outcome categories should follow the International Statistical Classification of Diseases and Related Health Problems (ICD) classification 30 along with other related endpoints (e.g. biomarkers of effect or functional outcomes), with related endpoints included under the same health outcome category. Furthermore, alignment, to the extent possible, with comparable health outcomes from experimental animal data should be made to facilitate better integration later. For example, lung (function) studies in animals can be linked to human epidemiological studies assessing pulmonary diseases and markers thereof. This process can be particularly challenging as regards not only differences in dose/exposure but also the validity and the relevance of certain animal data to human morbidity. Similarly, a lack of comparable measures between human observations and animal data may complicate comparisons across human and animal health outcome categories, where detailed histopathology and organ weights may be available from animal studies that cannot be measured in human settings. In such cases, the construction of an inventory of health outcome categories at the very beginning of the process may be useful where human and animal health outcome categories are displayed, and their similarities and differences are discussed (EFSA CEP Panel,  2023 ). Concerning description of studies, the details of the data extraction process need to be considered in the context of the relevance and number of the available studies, keeping in mind that the purpose of data extraction is to help assess and summarise the evidence in a systematic and concise manner so that the end‐users of the Opinion can understand why certain decisions were reached. If the number of studies is small, more details can be justified. When the number of studies is large, decisions are made based on more extensive information and then concise reporting is needed that summarises the 'overall picture'. If the data extraction process is outsourced, proper communication and preparation are needed between those doing the data extraction and the experts who will use the data. The final stage of this communication and preparation stage should include the list with the data extraction items, the data characteristics of the items to be extracted (e.g. string or numerical data), as well as the output of a piloting exercise performed by the WG/Panel members that will be used as a reference by the contractor. It is recommended that a literature review software is used for this purpose, allowing for comparative assessments across exposure categories and/or health outcome categories. Moreover, summary master files generated by standard systematic review software can facilitate a meta‐analysis, if appropriate. As an output of the data extraction, the recommended structure of the relevant section is as follows: (a) evidence base overview, (b) a short text description of the prioritised studies, (c) descriptive plots or summary tables including study characteristics and study results, (d) evidence synthesis (if appropriate), (e) hazard identification summary. As regards the evidence overview, it is recommended that the overview starts with a small 'setting the scene' paragraph reporting on the number of studies, the cumulative sample size, the median and range of study sample sizes and interquartile range (IQR), the available study designs and proportions thereof, the countries/populations of origin of the studies and proportions thereof, followed by the APRIO attributes [the population characteristics of the included studies (based on gender, age groups, risk factors), the exposure assessment methods/matrices and the range of the exposure levels, the endpoints under study and proportions thereof]. For example, in the recent update of the risk assessment of hexabromocyclododecanes (HBCDD) in food (EFSA CONTAM Panel,  2021 ), an evidence base consisting of 13 studies was formed across diverse endpoints. A short summary of the evidence base related to this risk assessment was as follows: The evidence base includes one cohort study, one birth cohort study (reported in four publications) and 11 cross‐sectional studies where the HBCDD exposure was assessed simultaneously or even later than the endpoint ascertainment. The sample size of the included observational studies ranged from 34 to 71,415 participants. All the evaluated populations came from European countries except for five cross‐sectional studies in which populations from the USA (n = 2), China, South Korea and Tanzania were investigated. The populations under study were diverse. Four studies recruited younger children or adolescents, while the remaining studies assessed adult female (n = 6), male (n = 2) or mixed (n = 1) populations. HBCDD exposure was assessed via serum biomarkers (n = 9), biomarkers in breast milk (n = 1), biomarkers in adipose tissue (n = 1), HBCDD measurements in dust (n = 1), or through merging dietary patterns and the presence of HBCDDs in food samples (n = 1). Birth weight/length, neurodevelopment and thyroid dysfunction were the three endpoint categories assessed in children. Subfertility, type 2 diabetes, thyroid hormone levels, severe endometriosis and ovarian endometrioma and breast cancer metastasis were the endpoints assessed in the adult populations . As far as the tables including study characteristics and study results are concerned and due to the complexity of the design and analysis of epidemiological studies, the full panel of results cannot always be tabulated. Keeping in mind that the full data extraction dataset can be available as a supplementary file, the emphasis in the main text should be put to the prioritised endpoints and exposure categories (Table  3 ). Example of study reporting table (EFSA CONTAM Panel,  2024 ). von Ehrenstein et al. (2007) India Prospective mother–child cohort study Wechsler Intelligence Scale for Children (no edition provided), Raven Colored, Progressive Matrices test, Total Sentence Recall test, Purdue pegboard test 351 5–15 years w‐As (μg/L) Mean (SD) Peak lifetime 147 (322) During pregnancy 110 (243) Tertiles  100  100 B (95% CI) Full scale IQ Peak lifetime Ref 0.006 (−0.031, 0.33) −0.16 (−0.56, 0.23) −0.06 (−0.30, 0.18) During pregnancy Ref −0.047 (−0.38, 0.28) −0.007 (−0.36, 0.34) −0.002 (−0.24, 0.24) KEY POINTS Define health outcome categories based on the current ICD along with other related endpoints (e.g. biomarkers of effect or functional outcomes (e.g. impaired cognition/reduced IQ, impaired growth)). Alignment of health outcome categories with animal data and organising evidence by health outcome categories. Extract data related to health outcome categories and study design, using literature review software as needed. Create the evidence overview, using evidence mapping tools where needed. Report the evidence base in a structured manner, including a short summary of the evidence base, followed by a description of the eligible studies, summary tables, and evidence synthesis and hazard identification. Define health outcome categories based on the current ICD along with other related endpoints (e.g. biomarkers of effect or functional outcomes (e.g. impaired cognition/reduced IQ, impaired growth)). Alignment of health outcome categories with animal data and organising evidence by health outcome categories. Extract data related to health outcome categories and study design, using literature review software as needed. Create the evidence overview, using evidence mapping tools where needed. Report the evidence base in a structured manner, including a short summary of the evidence base, followed by a description of the eligible studies, summary tables, and evidence synthesis and hazard identification. Define health outcome categories based on the current ICD along with other related endpoints (e.g. biomarkers of effect or functional outcomes (e.g. impaired cognition/reduced IQ, impaired growth)). Alignment of health outcome categories with animal data and organising evidence by health outcome categories. Extract data related to health outcome categories and study design, using literature review software as needed. Create the evidence overview, using evidence mapping tools where needed. Report the evidence base in a structured manner, including a short summary of the evidence base, followed by a description of the eligible studies, summary tables, and evidence synthesis and hazard identification. After asserting study eligibility, structuring the evidence base and data extraction, the next step would be the assessment of individual studies so that they can be integrated through the WoE approach. Before embarking on that step of the hazard identification, several aspects on the ability of each study to answer the assessment question need to be considered. They include: the internal validity of the available studies (i.e. RoB and its likely impact on the effect estimates). relevance of the study design (different interventions and observational designs) and their specific design characteristics for answering the research/assessment question. other factors such as characteristics of the recruited population including age and underlying health that may impact the external validity of the study. The relevance of these considerations is highly specific to the assessment question. the internal validity of the available studies (i.e. RoB and its likely impact on the effect estimates). relevance of the study design (different interventions and observational designs) and their specific design characteristics for answering the research/assessment question. other factors such as characteristics of the recruited population including age and underlying health that may impact the external validity of the study. The relevance of these considerations is highly specific to the assessment question. The internal validity of a study , or RoB assessed though structured appraisal tools (see Section  4.4.1.5.1 ), is one (albeit not the sole) of the core attributes that will determine how an individual study is integrated into the assessment of the totality of the evidence. When the RoB evaluation is used to assign studies to different tiers (e.g. 1, 2, or 3), such tiering can be used as a prior for assigning weights to individual studies. As the RoB assessment may not address the direction or magnitude of potential biases or the appropriateness, weighing the evidence cannot be performed relying only on the RoB assessment. For this reason, exclusion of studies based on their tiering during hazard identification is not recommended. Such a decision can only be justified if the RoB evaluation necessitates a reconsideration of the initial eligibility criteria or the assessment question (or both). This recommendation is in line with most guidance documents on evidence synthesis. For example, in the NTP‐OHAT handbook on systematic reviews it is specifically stated that: 'The tiering approach outlined by OHAT favors inclusion of studies unless they are problematic in multiple key aspects of study quality , an approach that offsets concern about potentially excluding studies based on a single measure, which could seriously limit the evidence base available for an evaluation'. As formulated, the term 'multiple key aspects' would not justify the exclusion of a study due to strong concerns for bias for one or two bias domains, which commonly leads to 'Tier 3' allocation. Moreover, as mentioned, individual studies will always have different uncertainties and biases. By evaluating the totality of the evidence during hazard identification, it can be determined whether the combination of individual studies can overcome possible biases identified in individual studies, thereby providing a more robust assessment (see also Section  4.4.1.7 ). There are several key considerations on specific design characteristics that should be considered along with the RoB assessment. Here the assessor needs to assess to what degree the study design and conduct is appropriate for providing meaningful input for answering the assessment question. These considerations include: Sample size. A reasonable question to ask for each study is if the sample size is sufficient to allow for the detection of an association or a relevant effect size. Of note, post hoc power calculations may not be of further help in such assessment (Heinsberg & Weeks,  2022 ) as the uncertainty around a significant or non‐significant effect estimate is already quantified by the confidence interval. Study duration and temporal separation between exposure and outcome: These issues are not explicitly included in many RoB tools (National Toxicology Program,  2019 ), although in some of EFSA's risk assessments the question on outcome has been customised for NTP‐OHAT to include temporal separation (EFSA CEP Panel,  2023 ). Intervention studies of too short duration or observational studies with too short follow‐up are likely to be biased towards the null (Falkingham et al.,  2010 ), regardless of how well they are conducted. Sample size. A reasonable question to ask for each study is if the sample size is sufficient to allow for the detection of an association or a relevant effect size. Of note, post hoc power calculations may not be of further help in such assessment (Heinsberg & Weeks,  2022 ) as the uncertainty around a significant or non‐significant effect estimate is already quantified by the confidence interval. Study duration and temporal separation between exposure and outcome: These issues are not explicitly included in many RoB tools (National Toxicology Program,  2019 ), although in some of EFSA's risk assessments the question on outcome has been customised for NTP‐OHAT to include temporal separation (EFSA CEP Panel,  2023 ). Intervention studies of too short duration or observational studies with too short follow‐up are likely to be biased towards the null (Falkingham et al.,  2010 ), regardless of how well they are conducted. To give a perspective on the points above, a study can be well conducted in terms of scoring low on RoB (relatively high internal validity) but can at the same time be close to meaningless for answering the question it aimed to address (low relevance). As an example, in EFSA's re‐evaluation of the non‐nutritive sweetener thaumatin, several intervention studies were identified examining if oral intake of this protein might lead to allergic responses, using the skin prick test (EFSA FAF Panel,  2021 ). Some of those studies recruited very few participants (e.g. indicatively, n  < 20). With the prevalence of common food allergies generally being well below 10%, no meaningful conclusion (or even an effect estimate) could be derived from such small studies. Similarly, an intervention study of only 2‐week duration aimed at examining the effect of a certain diet on blood lipids would provide limited information when taking into consideration that lipid‐lowering drugs achieve maximal benefits only after several (> 4) weeks of treatment (Gencer & Giugliano,  2020 ). The main take home message here is that a study can be well conducted in terms of RoB but at the same time may be poorly suited by design to answer the assessment question. The relevance of study design was largely covered in our discussion on experimental studies (Section  4.1.2.1 ), observational studies (Section  4.1.2.2 ) and their strengths and limitations (Sections  4.1.5.1 and 4.1.5.2 ). The different sources of bias of different study designs were also highlighted in Table  2 , which provides some background to their strengths and limitations, which need to be assessed relative to the exposure and health outcome being addressed. EFSA's assessment of the tolerable upper intake level (abbreviation: UL (upper level)) for selenium provides an example of the potential relevance of cross‐sectional studies in risk assessment, among available streams of evidence (i.e. mechanistic and animal data, and other experimental and observational studies in humans) (EFSA NDA Panel,  2023 ). Cross‐sectional studies have been carried out in seleniferous areas, for instance of China, India, and South and North America, and in such cases, it is reasonable to assume that current measured exposure should reflect long‐term exposure among those who have been living in such areas characterised by an exceptionally high selenium content in soil and drinking water, particularly if residents' diet largely depends on locally produced foods. This makes a stronger case for causality compared to cross sectional studies examining exposure that are less likely to be stable over time. For this reason, the cross‐sectional design is generally considered valid when investigating settings such as the aforementioned ones for selenosis. Conversely, there are several instances in which cross‐sectional (as well as some case–control) studies are subject to substantial RoB, and therefore should be considered with caution, and their exclusion in the evidence integration may be justified. This is the case, for instance, for some nutrients, such as sodium and potassium, the consumption of which is modified by early disease symptoms following dietary advice and/or metabolic and nutritional alterations. Similarly, RCTs focusing on complex dietary changes for outcomes with long latency periods like cancer are not necessarily more informative than well designed prospective studies (Hébert et al.,  2016 ). Also, the assumption that ecological studies provide no or limited information on cause and effect can be questioned (Li et al.,  2020 ). The reason for re‐highlighting these examples from previous sections is to emphasise that different study designs can be variably useful in risk assessments and that rules of thumb on study designs serve educational purposes rather than actual risk assessment endeavours. In summary, RoB assessment alone is not sufficient for determining the weight assigned to a given study prior to integrating the evidence. It is the combined assessment of the study type, its RoB and specific design attributes that should determine the weight given to a study when performing the WoE assessment. KEY POINTS Risk of bias assessment is only one of several aspects that need to be considered when assessing relevance and reliability of individual studies. Key study characteristics such as timing between exposure and outcome and sample size must also be thoroughly assessed. The relevance of a given study design for a specific assessment needs to be considered on a case‐by‐case basis. Exclusion of studies based on their assigned tier is generally discouraged when assessing epidemiological evidence. Risk of bias assessment is only one of several aspects that need to be considered when assessing relevance and reliability of individual studies. Key study characteristics such as timing between exposure and outcome and sample size must also be thoroughly assessed. The relevance of a given study design for a specific assessment needs to be considered on a case‐by‐case basis. Exclusion of studies based on their assigned tier is generally discouraged when assessing epidemiological evidence. Risk of bias assessment is only one of several aspects that need to be considered when assessing relevance and reliability of individual studies. Key study characteristics such as timing between exposure and outcome and sample size must also be thoroughly assessed. The relevance of a given study design for a specific assessment needs to be considered on a case‐by‐case basis. Exclusion of studies based on their assigned tier is generally discouraged when assessing epidemiological evidence. Critical appraisal tools provide a structured and transparent approach to assess the risk of systematic biases that may occur in individual studies (e.g. internal validity). In terms of use, it is important to take into consideration that many RoB tools were initially designed to assess RCTs. Some were later adapted, and new tools have been developed to address observational designs. Given the variety of observational designs and shorter history of use of RoB tools for such designs, some customisation (or tailoring) is often needed. When used on the basis of those principles, RoB tools provide a transparent structure for considering different types of bias, which is an important information for further evidence synthesis and assessment of uncertainty. Another point to consider is that, for a particular hypothesis, the bias domains to focus on when using RoB tools may differ. For example, if environmental exposure to a chemical is hypothesised to cause lung cancer, then confounding by smoking may be of concern. Occupational cohort studies, for example using past employment records usually do not include smoking data, whereas case–control studies conducted in the general population usually include this – if smoking is not a strong confounder in the case–control studies, it is also not likely to be a confounder in cohort studies conducted in comparable populations. On the other hand, if chemical exposure occurs occupationally, then confounding is unlikely to be important (there may be little or no confounding in comparisons of different groups of manual workers 31 ), but other potential biases (e.g. the healthy worker effect) may be of more concern. Thus, the outcome of the RoB assessment should focus on the bias domains that are expected to have the highest influence on the uncertainty related to the assessment question. Regardless of study design, the domains covered by RoB tools include selection bias , confounding and information bias (see definitions in Sections  4.2.1.4 and 4.2.1.4.3 ). 32 When applying RoB tools to individual studies understanding the different processes that may lead to different biases, their complexity and knowing what to look for is fundamental for facilitating proper use. Below a short comparison for the three main types of biases that may occur in experimental and non‐experimental designs are given: Selection bias: For RCTs , this relates to the appropriate randomisation of study subjects, including both the process of allocation and the procedures to conceal it. These factors are relatively straightforward to assess if the baseline characteristics across groups and the method of randomisation are properly described. If not, considerations around reported balance of the treatment groups at baseline provide important information on possible RoB. 33 For observational studies, selection bias relates to the procedures used to select study participants. This can be difficult or impossible to evaluate as information on whether study participants are systematically different from those who were eligible (but not recruited) is usually missing. Selection bias can also occur in both experimental and observational studies if participants are selectively lost to follow‐up during the study period. In principle, the risk of such bias can be assessed by comparing the baseline characteristics of those lost to follow‐up vs those who were not. Confounding: Assuming that the randomisation process is appropriate and study size is adequate, bias due to confounding in RCTs may still occur if there are deviations from intended treatment . 34 Such bias may occur if participants or investigators are not blinded to treatment allocation. The intention to blind participants and investigators can easily be evaluated by study reporting, but how influential blinding is in terms of avoiding differential treatment is more difficult to evaluate. For observational studies assessing confounding is even more complex as the exposure is rarely randomly allocated by nature. As a result, confounding has to be controlled for. Even if known confounders are accounted for, confounding due to unidentified factors or improper confounder control in the analysis (‘residual confounding’) can never be fully excluded – although it may be possible to estimate its likely strength and direction (and in some instances, residual confounding may be very small). Compared to RCTs, where some inferences on differential treatment can be made regardless of the research question, assessing bias due to confounding in the observational study setting is most often study specific and requires expert judgement and experience on a case‐by‐case basis (e.g. what are the likely sources of confounding for this particular setting). Information bias: Information bias involves misclassification of the study participants with respect to exposure, outcome or confounder status. For RCTs and observational studies, there are no differences in terms of how outcome misclassification may occur or how such biases are assessed. Problems with exposure and confounder misclassifications are however more specific to observational designs, as both exposure and relevant confounders need to be assessed (quantified) as opposed to being largely taken care of by design in RCTs. Reporting bias is also one form of information bias that is commonly assessed in RoB tools. Standards exist of both designs, but since RCTs are generally conducted to test one or very few hypotheses, selective reporting is often easier to identify compared to observational studies that can be of more explorative nature. Selection bias: For RCTs , this relates to the appropriate randomisation of study subjects, including both the process of allocation and the procedures to conceal it. These factors are relatively straightforward to assess if the baseline characteristics across groups and the method of randomisation are properly described. If not, considerations around reported balance of the treatment groups at baseline provide important information on possible RoB. 33 For observational studies, selection bias relates to the procedures used to select study participants. This can be difficult or impossible to evaluate as information on whether study participants are systematically different from those who were eligible (but not recruited) is usually missing. Selection bias can also occur in both experimental and observational studies if participants are selectively lost to follow‐up during the study period. In principle, the risk of such bias can be assessed by comparing the baseline characteristics of those lost to follow‐up vs those who were not. Confounding: Assuming that the randomisation process is appropriate and study size is adequate, bias due to confounding in RCTs may still occur if there are deviations from intended treatment . 34 Such bias may occur if participants or investigators are not blinded to treatment allocation. The intention to blind participants and investigators can easily be evaluated by study reporting, but how influential blinding is in terms of avoiding differential treatment is more difficult to evaluate. For observational studies assessing confounding is even more complex as the exposure is rarely randomly allocated by nature. As a result, confounding has to be controlled for. Even if known confounders are accounted for, confounding due to unidentified factors or improper confounder control in the analysis (‘residual confounding’) can never be fully excluded – although it may be possible to estimate its likely strength and direction (and in some instances, residual confounding may be very small). Compared to RCTs, where some inferences on differential treatment can be made regardless of the research question, assessing bias due to confounding in the observational study setting is most often study specific and requires expert judgement and experience on a case‐by‐case basis (e.g. what are the likely sources of confounding for this particular setting). Information bias: Information bias involves misclassification of the study participants with respect to exposure, outcome or confounder status. For RCTs and observational studies, there are no differences in terms of how outcome misclassification may occur or how such biases are assessed. Problems with exposure and confounder misclassifications are however more specific to observational designs, as both exposure and relevant confounders need to be assessed (quantified) as opposed to being largely taken care of by design in RCTs. Reporting bias is also one form of information bias that is commonly assessed in RoB tools. Standards exist of both designs, but since RCTs are generally conducted to test one or very few hypotheses, selective reporting is often easier to identify compared to observational studies that can be of more explorative nature. When using RoB tools it is important to note that all existing appraisal approaches have their strengths and limitations (Bero et al.,  2018 ). Ideally, different instruments should lead to the same conclusion when applied to the same study. However, very comprehensive tools may indirectly lead to too much focus on minor issues. On the other hand, simple tools may lead to important aspects being overlooked in some cases. Different RoB tools may also use different formulations for questions aimed at assessing the same types of biases. To give an example of differences in formulations across RoB tools for human studies we used selection bias as an example. To demonstrate that, as an example, we compare the formulation in the NTP‐OHAT 35 and USDA Nutrition Evidence Systematic Review 36 ( NESR ) RoB tools for observational designs: The NTP‐OHAT tool asks a single question: 'Did selection of study participants result in appropriate comparison groups?' The USDA‐NESR asks: 'Was selection of participants into the study (or into the analysis) based on participant characteristics observed after the start of exposure?' Based on the answer this question (yes/no), several sub‐questions follow, including if post‐exposure variables may have influenced exposure and outcome. The NTP‐OHAT tool asks a single question: 'Did selection of study participants result in appropriate comparison groups?' The USDA‐NESR asks: 'Was selection of participants into the study (or into the analysis) based on participant characteristics observed after the start of exposure?' Based on the answer this question (yes/no), several sub‐questions follow, including if post‐exposure variables may have influenced exposure and outcome. A similar formulation as used in the USDA‐NESR is also used in the Robins‐I tool for non‐randomised interventions (Sterne et al.,  2016 ). On the other hand, the Newcastle‐Ottawa Scale for assessing the quality of non‐randomised studies again uses a slightly different formulation and different scales for cohort and for case–controls studies (unlike NTP‐OHAT and USDA‐NESR). The purpose of this example is to highlight that formulations used to capture possible selection bias can be quite different. Without further consideration, this may lead to different conclusions if the focus of the assessment is on the exact wording of individual questions (the bias that should be captured is the same, independent of how the question is asked). For human RCTs more consistent formulations are generally in use. As an example, the Cochrane risk of bias tool (RoB 2.0) asks: 'Was the allocation sequence (1) random (2) and concealed until participants were enrolled and assigned to interventions; and (3) did baseline differences between intervention groups suggest a problem with the randomisation process?' Similarly, the NTP‐OHAT asks: '(1) was administered dose or exposure level adequately randomised, (2) was allocation to study groups adequately concealed; and (3) did selection of study participants result in appropriate comparison groups?' the Cochrane risk of bias tool (RoB 2.0) asks: 'Was the allocation sequence (1) random (2) and concealed until participants were enrolled and assigned to interventions; and (3) did baseline differences between intervention groups suggest a problem with the randomisation process?' Similarly, the NTP‐OHAT asks: '(1) was administered dose or exposure level adequately randomised, (2) was allocation to study groups adequately concealed; and (3) did selection of study participants result in appropriate comparison groups?' Although slightly different, the two formulations are for all practical purposes identical. The similarity for this RCT example may be due to the longer history of RoB tools for RCTs which has perhaps resulted in better harmonisation. In contrast, there is currently no single standard or consensus about the best approach for assessing RoB in observational studies (Page et al.,  2018 ; Viswanathan et al.,  2013 ). In summary, it is crucial that those using different RoB tools are aware of what types of biases are being captured and what to focus on and look for, being able to understand the issues queried by each of the individual questions , which are formulated to guide the assessor. A simple checklist‐type formulation can never cover all scenarios encountered when appraising different studies, but they do provide a structured approach for assessing biases, which needs to be tailored for each assessment. Finally, to put the content of this section on use of RoB tools in perspective, Appendix  F Appraisal of different studies using a RoB tool contains a series of appraisal examples aimed at demonstrating how appraisal of individual studies using a RoB tool could be performed. For this purpose, the examples cover appraisal of both double blind RCTs and randomised nutritional intervention studies, as well as observational designs (cohort and case control studies) relevant to chemical risk assessment. Each of these examples is aimed at highlighting the principles and considerations that need to be considered when assessing different types of biases for individual studies. The examples are chosen for illustrative purposes only and some of the points made could be subject to a different interpretation. After assessing a study using RoB tools, the reviewers' judgements attached to each question in a given appraisal tool are documented and translated into an overall summary assessment. This could, for example, be in a form of a short text summary grouping of studies according to types of bias that may occur (see section below on evidence synthesis). ranking of studies into tiers (from low to high RoB) or numerical scoring. short text summary grouping of studies according to types of bias that may occur (see section below on evidence synthesis). ranking of studies into tiers (from low to high RoB) or numerical scoring. Numerical scoring here refers to the approach of assigning a numerical value to each RoB question that is then summarised in some way (perhaps using different weights) into an overall score. The use of numerical scores (or scales) for assessing quality or RoB is currently explicitly discouraged by the Cochrane handbook (Higgins & Green,  2011 ). Despite their proffered convenience and simplicity, scaling systems rely on weight assignment on different items of the scale. Such an approach bears three major limitations: it is difficult to replicate, it is not transparent to the final user of the risk assessment, and it does not accurately reflect study validity (Emerson et al.,  1990 ; Jüni et al.,  1999 ; Schulz et al.,  1995 ). Ranking of quality of the evidence of a study into tiers is a better alternative to numerical scoring, as it is more transparent because it relies on fewer well‐defined attributes (bias questions) that determine the overall summary assessment. Summary assessments by ranking (into tiers) allow not only for an overview of the evidence within each tier but also for a structured appraisal of the whole body of evidence. Such a summary can be done in the form of heat maps. One potential problem when relying on summary assessments by ranking into tiers is that it may obscure the fact that judgements on the overall body of evidence should always consider the type, and possible direction and magnitude of potential biases identified across different studies. Even though it is often difficult to assess such parameters, it is important. Summarising RoB by tiers tends to hide these important attributes. In cases where most studies suffer from the same type of bias (including possible direction), assessing the overall body of evidence by looking at individual tiers from each study is more justified. In other cases, the type and direction of biases must be assessed in parallel. For example, suppose that several studies of the same design have been rated as having either moderate or high RoB, but all the studies consistently show the same association of exposure and health outcome. Based on simple tiering (or scoring), some assessors may conclude, when weighing the evidence, that the quality is low and that limited conclusions on causality can be made due to the present RoB. However, a more careful inspection may reveal that there are different sources of biases across these studies, with some scoring low on selection bias, but high on other aspects, while other studies scoring low on confounding or information bias score high on other aspects. Further evaluation may then reveal that the direction of these potential biases across studies is likely to be different. In that case, it is highly unlikely that the consistently observed associations are due to these potential biases (since they would work in different directions). Such a scenario is not just hypothetical, and the approach of taking both type and possible direction of bias into account compared to just looking at the RoB scoring can lead to different conclusions. Of course, if the risk of same type of bias would have been present in most or all the studies evaluated and the expected direction is anticipated to be the same, then it is not surprising if the studies produce similar findings – they may all be wrong. A further discussion of these issues is the subject of Section  4.4.1.6 . In summary, the considerations above once again highlight the importance of assessing both the magnitude (where possible) and the direction of different biases when evaluating individual studies. This should allow for better evidence integration than simply focusing on individual study tiers. As discussed above, the outcome of a RoB assessment is just one of several key aspects that need to be considered in evidence integration. Parallel to wider use and experience gained of applying systematic reviews in evidence synthesis, the limitations of prioritising studies based on simple tiering through RoB assessments is increasingly being recognised (Steenland et al.,  2020 ). This has led to some methodological development on how to make more thorough use of RoB assessment in evidence integration. One approach that has been suggested is causal inference by triangulation. That approach is more in line with the Bradford Hill viewpoints that were intended to aid integration of all available evidence but not to ‘judge’ individual studies. The concept of ‘triangulation’ extends the approach of Bradford Hill, in that it explicitly seeks to consider evidence from different types of studies and/or studies in different contexts, so that the strength and direction of various possible biases can be assessed. To give an example, Pearce et al. ( 1986 ) conducted a case–control study of pesticide exposure and non‐Hodgkin lymphoma, which involved two control groups: (i) a general population control group and (ii) an ‘other cancers’ control group. It was hypothesised that the former control group would produce an upward bias in the estimated odds ratio (differential recall bias if healthy general population controls are less likely to remember previous exposure than the cancer cases), whereas the ‘other cancers’ control group could produce a downward bias in the estimated odds ratio (if any of the other cancers were also caused by the pesticide exposure under study). Both groups yielded similar findings, indicating that neither bias was occurring to any discernible degree. This provided strong evidence that little recall bias (a type of information bias) or selection bias was occurring. However, a simple tiering of this study as a result of bias evaluation likely would have led to low prioritisation as both components of the study might have been considered high RoB (albeit in opposite directions). Triangulation is also consistent with the approach advocated by Savitz et al. ( 2019 ) who argue that RoB assessments should focus on identifying a small number of the most likely influential sources of bias, classifying each study on how effectively it has addressed each of these potential biases (or was likely to have the bias) and determining whether results differ across studies in relation to susceptibility to each hypothesised source of bias. For example, information bias is unlikely to explain positive findings of studies with non‐differential exposure and/or outcome misclassification if stronger findings are found among studies with more accurate assessment. A good example of triangulation by assessing exposure quality can be found in Lenters et al. ( 2011 ) who evaluated the association between asbestos and lung cancer. In this analysis, stratification by exposure assessment characteristics revealed that studies with a well‐documented exposure assessment, larger contrast in exposures, greater coverage of the exposure history by the exposure measurement data, and more complete job histories had higher risk estimates per unit dose than did studies without these characteristics. Similar observations have been made for other important environmental and occupational exposures (Vlaanderen et al.,  2011 ). KEY POINTS An alternative to study tiering is to use the outcome of the risk of bias evaluation for causal inference by triangulation. This may allow for more thorough evaluation of biases and their possible consequences for the risk assessment rather than just focusing on the presence of possible biases in individual studies. An alternative to study tiering is to use the outcome of the risk of bias evaluation for causal inference by triangulation. This may allow for more thorough evaluation of biases and their possible consequences for the risk assessment rather than just focusing on the presence of possible biases in individual studies. An alternative to study tiering is to use the outcome of the risk of bias evaluation for causal inference by triangulation. This may allow for more thorough evaluation of biases and their possible consequences for the risk assessment rather than just focusing on the presence of possible biases in individual studies. The aim of this section is to provide specific guidance on how to assemble and integrate evidence from human studies in line with EFSA's WoE guidance (EFSA Scientific Committee,  2017 ). EFSA's guidance on WoE provides a flexible framework for assembling available information into lines of evidence and weighing them at each step of the integration process. The NTP‐OHAT guidance on systematic reviews (National Toxicology Program,  2019 ) and GRADE (Morgan et al.,  2019 ) can be considered one of several possible implementations under that framework. A modified illustration from the WoE guidance is shown in Figure  2 . The figure shows how all evidence (human, animal, in vitro and in silico) can be assembled and integrated. General framework for hazard assessment, assembling and integrating different lines of evidence. Adapted from the EFSA guidance on WoE (EFSA Scientific Committee,  2017 ). The yellow bars indicate human lines of evidence. Although human studies are shown as one line of evidence in Figure  2 , this line would reflect integration of several lines of human evidence covering all studies meeting the inclusion criteria (of varying design addressing different exposure and outcome measures in different populations). In some cases, e.g. within the area of nutrition, the evidence base can be quite large with many observational and intervention studies being available for each health outcome category (EFSA NDA Panel,  2019 , 2022 ). In other areas, such as for certain contaminants or food additives, much fewer studies are usually available (EFSA FAF Panel,  2021 ; EFSA Scientific Committee,  2023b ). Ideally, the evidence should be assembled around a specific health outcome category. Within each health outcome category, separate lines of evidence can be established, taking into consideration: Different study population subgroups, such as children, pregnant and lactating women, adults, and the elderly Timing of exposure Representativeness of the study population for the population of concern Different study population subgroups, such as children, pregnant and lactating women, adults, and the elderly Timing of exposure Representativeness of the study population for the population of concern The reason for considering these factors when assigning studies to different lines of evidence is usually expected differences in sensitivity or susceptibility with respect to exposure. This could, for example be due to differences in age, health and underlying nutritional status and, when relevant, genetic background (see Figure  3 ). Different sub‐categories within a given health outcome category constructed based on timing of exposure or population under study (*within each sub‐category, different study designs can be assembled as shown in Figure  4 ). Within each sub‐category available studies can be grouped according to their design (see Figure  4 ), as this allows for integration across studies with similar sources of design‐related attributes and potential biases (see Table  2 ). Deviations from this approach can be taken on a case‐by‐case basis and this may, for example, be appropriate when using the method of triangulation discussed above. General framework for assembling and weighing epidemiological evidence for a specific health outcome category. Concerning Figure  4 , more weight should be given to studies that by design are better suited to answer the assessment question. A priori this does not mean that RCTs should receive the highest weight according to some pre‐defined evidence pyramid (Pandis,  2011 ). For example, if the outcome under consideration has a long latency period such as CVD, well conducted cohort studies could be prioritised. RCTs assessing intermediate CVD risk factors could then be used as supportive line of evidence when making judgement on causality. In other cases, if several well‐conducted and sufficiently powered RCTs are available that can address the assessment question, such evidence would take precedence over observational studies that would then provide supporting evidence. In short, the study design (different interventional or observational) should not by default determine the weight assigned to each study without careful consideration on its suitability to answer the assessment question. Based on the assembling studies as suggested in Figures  3 , 4 , integration across studies can be performed by assigning weights or confidence levels to individual studies based on the study design, other design specific considerations, the RoB evaluation and other relevant factors identified. Then the evidence can be integrated across all studies (Figure  4 ) within each sub‐category (Figure  3 ). Although in principle the weights or confidence levels assigned to individual studies could be both qualitative or quantitative, a common approach is to provide a short‐written argument explaining how weight/confidence is assigned, taking all the above‐mentioned factors into consideration. In other cases, such as the NTP‐OHAT guidance on systematic reviews, 37 confidence levels assigned to individual studies are already predefined based on study design (very low to high initial confidence). Using those levels as starting point, the evidence integration then takes into consideration the RoB evaluation and other factors that may result in upgrading or downgrading of the body of evidence (see the NTP‐OHAT guidance 38 and, for a practical example, the re‐evaluation of erythritol (E 968) as a food additive (EFSA FAF Panel,  2023 )). KEY POINTS Evidence should preferably be assembled around specific health outcome categories. Within each health outcome category, separate lines of evidence can be established, taking into consideration different study population subgroups with known (or assumed) differences in sensitivity or susceptibility with respect to exposure. If the number of available studies allows, grouping studies by design is one option for structured evidence integration. This allows for weighing the combined evidence taking the complimentary strength and weaknesses of each design into consideration. Evidence should preferably be assembled around specific health outcome categories. Within each health outcome category, separate lines of evidence can be established, taking into consideration different study population subgroups with known (or assumed) differences in sensitivity or susceptibility with respect to exposure. If the number of available studies allows, grouping studies by design is one option for structured evidence integration. This allows for weighing the combined evidence taking the complimentary strength and weaknesses of each design into consideration. Evidence should preferably be assembled around specific health outcome categories. Within each health outcome category, separate lines of evidence can be established, taking into consideration different study population subgroups with known (or assumed) differences in sensitivity or susceptibility with respect to exposure. If the number of available studies allows, grouping studies by design is one option for structured evidence integration. This allows for weighing the combined evidence taking the complimentary strength and weaknesses of each design into consideration. It is important to note that the grouping of studies as suggested in in Figures  3 , 4 may not be feasible when few studies are available. Formal procedures for WoE (Higgins et al.,  2023 ; National Toxicology Program,  2019 ), that tend to be time consuming, may also be less relevant when the evidence is small (although in principle they can be applied). When the evidence base is small, a simple narrative description reflecting on the strengths and limitations of the evidence can be more appropriate than a structured approach designed around a large evidence base. The Scientific Committee's opinion on copper provides a good example of how the WoE guidance (EFSA Scientific Committee,  2017 ) can be applied in a structured but simple manner to an assessment with few available studies (EFSA Scientific Committee,  2023b ). Consistency or inconsistency across studies may be highlighted in the summary of all relevant studies, which may be done narratively or graphically, or more formally through a meta‐analysis. Although each health outcome category is usually constructed around a range of related health outcomes, how broadly each health outcome category is defined depends on the size of the evidence base (i.e. how many studies). For example, in the EFSA opinion on BPA (EFSA CEP Panel,  2023 ), a range of partly unrelated outcomes such as sex ratio, live birth rate, follicular phase length and endometrial wall thickness were all included in one health outcome category named 'female fertility'. The few studies addressing each health outcome justified such grouping. The obvious limitation, however, is that integrating the evidence across a few weakly related outcomes is challenging and may not be substantiated in terms of suspected mode of action. An alternative approach may be equally challenging as the conclusions drawn from single or very few studies are usually not very robust, irrespective of how well they have been conducted. The NDA opinion on sodium provides an example on how to construct different lines of evidence in situations where the evidence base is large (EFSA NDA Panel,  2019 ). In that opinion, a few hundred epidemiological studies addressing cardiovascular health outcomes were identified. In such cases, it is possible to construct different lines of evidence within each health outcome category around related health outcomes. Here the lines of evidence for sodium excretion and (1) raised blood pressure, (2) hypertension and (3) stroke were constructed. As expected, the number of both experimental and observational studies for raised blood pressure was large. Fewer studies were available for hypertension. The number of RCTs and non‐experimental studies on the effects of sodium intake on blood pressure was quite large, which allowed to conduct a dose–response meta‐analysis between 24‐h sodium urinary excretion and both systolic and diastolic blood pressure. Therefore, only interventions examining the effect of sodium reduction on blood pressure were included. For hypertension, fewer intervention studies were available than observational studies. For stroke or coronary heart disease (CHD), only three observational cohort studies were available, and the same was true for the risk of overall CVD. In such cases, conclusion from clinical markers, which are intermediate steps or risk factors (e.g. changes in blood pressure), can provide key supporting evidence for the disease endpoint of concern (e.g. stroke 39 ). Similar grouping of related health outcomes into different lines of evidence, each re‐enforcing each other, is strongly encouraged in risk assessment. The same approach could, for example, be made for other related health outcomes such as grouping blood lipids, hypercholesterolaemia and CHD into three related but separate lines of evidence (see Figures  4 and 5 ), provided that they influence the health outcome being assessed. For each health outcome, all study designs should preferably be included because the findings of interventions conducted in controlled conditions in healthy volunteers or selected populations may not be very representative compared to the general population or older persons more at risk. Aligning the available human evidence within a given health outcome category into different lines of evidence (LoE) based on individual health outcomes. The key here is to use the available evidence to the extent possible, building a case for or against causality for a disease endpoint with supporting studies examining related clinical markers (or risk factors). Thus, the assembled lines of evidence should also include health outcomes that may ultimately not necessarily be considered for establishing a HBGV. KEY POINTS It is recommended to assemble different lines of evidence within each health outcome category around related health outcomes. Conclusions reached from biomarkers of effect which are intermediate steps or risk factors for the disease endpoint of concern could be used as evidence for or against causality for the disease endpoint of concern, especially where there is limited density of evidence on the endpoint. It is recommended to assemble different lines of evidence within each health outcome category around related health outcomes. Conclusions reached from biomarkers of effect which are intermediate steps or risk factors for the disease endpoint of concern could be used as evidence for or against causality for the disease endpoint of concern, especially where there is limited density of evidence on the endpoint. It is recommended to assemble different lines of evidence within each health outcome category around related health outcomes. Conclusions reached from biomarkers of effect which are intermediate steps or risk factors for the disease endpoint of concern could be used as evidence for or against causality for the disease endpoint of concern, especially where there is limited density of evidence on the endpoint. Simultaneously with the retrieval, appraisal and WoE for human studies, a similar process is carried out for other evidence streams (i.e. toxicological data), see Figure  1 . It is not within the scope here to provide guidance on this process for other evidence streams but it is recommended to align the endpoints with comparable human health outcomes. When integrating human evidence with other evidence streams, information and assessment of mode of action (MoA) provides a structured biologically driven way of integration (see also Section  4.1.5 on cause and effect). Key considerations for a MoA assessment have been described in the EFSA/ECHA guidance for the identification of endocrine disruptors (ECHA and EFSA,  2018 ) and in a joint report by the Committee on Toxicity and Committee on Carcinogenicity in the UK ( 2021 ). These considerations encompass for example: Have substance‐related adverse effects been observed in experimental studies in laboratory animals? Is there sufficient information from those studies to establish a MoA for each Key Event 40 ? Is the relationship between the Key Events in the MoA biologically plausible? This should be assessed based on a broader knowledge of biology. Is it plausible that the effect can occur in humans based on toxicokinetic and toxicodynamic considerations? Does the available evidence support the biological plausibility for the MoA? Here, the evidence must be assessed for dose and temporal concordance. Have substance‐related adverse effects been observed in experimental studies in laboratory animals? Is there sufficient information from those studies to establish a MoA for each Key Event 40 ? Is the relationship between the Key Events in the MoA biologically plausible? This should be assessed based on a broader knowledge of biology. Is it plausible that the effect can occur in humans based on toxicokinetic and toxicodynamic considerations? Does the available evidence support the biological plausibility for the MoA? Here, the evidence must be assessed for dose and temporal concordance. Use of AOPs (either existing or postulated for the purpose of the risk assessment) can be particularly useful, as these provide links between data generated from in vitro/in silico methods, animal models and humans. Furthermore, depending on the maturity of the AOP, it may facilitate quantitative assessment of the relationships between the key events, thereby allowing for predictions on adversity based on in vitro methods measuring key events, including how to integrate exposure considerations by using physiologically based pharmacokinetic (PBPK) models. The guidance for identification of endocrine disruptors also provides guidance on the reporting and examples and how the evidence is mapped corresponding to the level of biological organisation, thus aligning to MoA/AOP frameworks (ECHA and EFSA,  2018 ). Furthermore, capturing the strength of the evidence is also recommended when conducting the WoE assessment. For more detail, see Section  4.4.2.1 on AOP‐based integrated approach to testing and assessment (IATA). Considerations for the integration of MoA can be found in several assessments of pesticide active substances on assessment of identification of endocrine disruptive properties. One relevant example is the assessment of metribuzitan, where it was concluded that the substance had endocrine disruptive properties regarding the thyroid modality, but not for oestrogen, androgen and steroid modalities (EFSA,  2023a ). An example on MoA considerations related to species differences and human relevance can be found in the Opinion on inorganic arsenic (EFSA, 2009 ). Here it was concluded that data from experimental animals could not be used for risk characterisation, because of toxicokinetic differences between humans and animals in their ability to methylate inorganic arsenic and differences in excretion of the metabolites (humans excrete more). In certain situations, experimental animal data are considered but not fully integrated into the risk assessment, e.g. when the human data are abundant and robust, as is often the case for nutrition. In such cases only certain data from laboratory animals may be integrated, such as data on absorption, distribution, metabolism and excretion (ADME). An example of this is the 2015 opinion on caffeine, where rodent studies were not considered in the hazard characterisation (EFSA NDA Panel,  2015 ). Another example is the scientific advice on a tolerable upper intake level for dietary sugars (EFSA NDA Panel,  2022 ). In this case, the large availability of human data about both ADME of dietary sugars and their potential adverse effects, arising from both experimental and observational epidemiological studies, allowed the assessment to be based almost entirely on human evidence. For regulated products and non‐regulated chemicals, differences exist in the availability and nature of data. The initial approval process (pre‐marketing authorisation) of regulated products, such as food additives and pesticides, is based on toxicological experiments in animals and in vitro/in silico data, as determined by the respective data requirements, and, if it exists, on human experimental data (e.g. food additives). However, for later post‐marketing assessments, human observational studies might be available, and these should be taken into consideration. The latter are often not designed to support the authorisation process as they consider post‐marketing observations where the population is not only exposed to the chemical under assessment. For non‐regulated chemicals, all evidence is taken account of, and the fact that available human observational studies as well as laboratory animal studies are rarely designed to directly address the risk assessment questions needs to be considered. One generic approach for the integration of evidence from human studies with other toxicological data for regulated product and non‐regulated chemicals is proposed based on an approach initially developed for regulated products, in this case pesticides (EFSA PPR Panel,  2017 ), see Figure  6 . Approach for systematic integration of epidemiological data with other streams of evidence, see supporting explanations below (*generally, precedence will be given to these studies). After all relevant and reliable epidemiological evidence has been identified, the separate lines of evidence need to be integrated with the other relevant and reliable lines of evidence (laboratory animal, in vitro and in silico data). Available reliable and relevant laboratory animal studies would generally take precedence over in vitro/in silico data unless there is evidence that the laboratory animal model studies do not capture the effect in question. Furthermore, studies compliant with OECD 41 guidelines are by default considered to have a high reliability, unless there is evidence of the contrary. The toxicological data might be corroborated by mechanistic in silico/in vitro data and thus strengthened, whereas a meta‐analysis of the human studies may provide more precise effect estimate for possible health effects. Studies that are found to be more relevant for the health outcome in question are to be given more weight, regardless of whether the data come from human or laboratory animal studies. Where human observational/experimental data are of highest relevance, and they are supported by a mechanistic scientific foundation (see above on MoA considerations), they should take precedence over the experimental animal data. When human and toxicological data are judged to be of similar relevance, it is important to assess their concordance (consistency across the lines of evidence) in order to determine which data set may be given precedence. Concerning human observational data, it is important to stress that a single study in isolation, no matter how well‐conducted, generally only provides part of the evidence, as there are variations in study settings and conduct and possible influence of biases (that can rarely be fully eliminated). A basis for WoE and assessment of concordance ideally requires several studies that vary by design and/or study population/setting so that consistency can be assessed, as pointed out in Section  4.4.1.7 . As discussed above, a robust human experimental database may in some cases be used as a stand‐alone, without a full integration of other streams of evidence. In case of concordance between human and other toxicological data, the risk assessment should use all the data, as both yield similar results in either hazard identification (e.g. both indicate the same hazard) or hazard characterisation (e.g. both suggest similar levels). Thus, both can reinforce each other, and similar mechanisms may be assumed in both cases. In case of non‐concordance and similar reliability/relevance , one needs to take account of this uncertainty. For any non‐concordance, the reason behind the difference should be considered. It is a matter of expert judgement to select the most appropriate study/studies depending on the situation, for example on the data density, dose‐spacing/exposure in different studies, mechanistic understanding (toxicodynamics and toxicokinetics), possible species differences, specific biases in the evidence streams. Regardless of the consistency between experimental animal and human observational data, an assessment of biological plausibility is warranted by including other lines of evidence (mechanistic data and experimental animal data). Challenges are present when results related to specific endpoints are markedly inconsistent between humans and animals (e.g. the association between pesticide exposure and Parkinsonian disorder (EFSA PPR Panel,  2017 )), or when the health outcome associated to exposure has no animal correlate (e.g. the association between exposure to the pyrethroid pesticide, deltamethrin, and autism spectrum disorder and attention deficit hyperactivity disorder (EFSA PPR Panel,  2021 ). In both cases, biological plausibility for the associations found in epidemiological studies was strengthened by means of developing AOPs. Three different approaches for using AOPs may be applied that improve the interpretation of human data by providing a plausible mechanistic link to adverse outcomes supporting the contextualisation in the risk assessment process: Look for an already developed AOP, see OECD AOP wiki. 42 This platform is both open access and is continuously being updated. If a relevant AOP(s) exist and in silico/in vitro/in vivo data exist that supports that the chemical under assessment triggers the AOP at relevant dose/concentration levels, then there is support for biological plausibility for causality. The assessment of endocrine disrupting properties of pesticides and biocides is in practice applying such an approach on a routine basis. Develop an AOP based on retrieved data on a known stressor compound (not the compound under assessment). For example, data (in vitro as well as in vivo) for the well‐known neurotoxic compound rotenone was used to develop the AOP on inhibition of the mitochondrial complex I leading to Parkinsonian motor deficiencies which has been endorsed by OECD (EFSA PPR Panel,  2017 ). Once the AOP has been developed, it can be concluded whether it is biologically plausible that chemicals (in this case certain pesticides) triggering the AOP are plausible risk factors for Parkinson's Disease. Such an approach is resource demanding in terms of time and expertise. However, it may be helpful in the situation where there is human observational evidence of an association while the data from animal experiments does not show an effect, because the animal model is not capturing the effect. Carry out a systematic literature review and critical appraisal of all the evidence on the compound in question (human observational studies, in vivo rodent studies, in vitro data), a quantitative uncertainty analysis of all the evidence using expert knowledge elicitation (EKE) and a probabilistic approach, and finally integrate the data using the AOP conceptual framework. The AOP triggered by a specific compound (stressor‐based AOP) can then support an IATA as exemplified for the developmental neurotoxic effects of deltamethrin (EFSA PPR Panel,  2021 ). Look for an already developed AOP, see OECD AOP wiki. 42 This platform is both open access and is continuously being updated. If a relevant AOP(s) exist and in silico/in vitro/in vivo data exist that supports that the chemical under assessment triggers the AOP at relevant dose/concentration levels, then there is support for biological plausibility for causality. The assessment of endocrine disrupting properties of pesticides and biocides is in practice applying such an approach on a routine basis. Develop an AOP based on retrieved data on a known stressor compound (not the compound under assessment). For example, data (in vitro as well as in vivo) for the well‐known neurotoxic compound rotenone was used to develop the AOP on inhibition of the mitochondrial complex I leading to Parkinsonian motor deficiencies which has been endorsed by OECD (EFSA PPR Panel,  2017 ). Once the AOP has been developed, it can be concluded whether it is biologically plausible that chemicals (in this case certain pesticides) triggering the AOP are plausible risk factors for Parkinson's Disease. Such an approach is resource demanding in terms of time and expertise. However, it may be helpful in the situation where there is human observational evidence of an association while the data from animal experiments does not show an effect, because the animal model is not capturing the effect. Carry out a systematic literature review and critical appraisal of all the evidence on the compound in question (human observational studies, in vivo rodent studies, in vitro data), a quantitative uncertainty analysis of all the evidence using expert knowledge elicitation (EKE) and a probabilistic approach, and finally integrate the data using the AOP conceptual framework. The AOP triggered by a specific compound (stressor‐based AOP) can then support an IATA as exemplified for the developmental neurotoxic effects of deltamethrin (EFSA PPR Panel,  2021 ). A full integration into a risk assessment also requires a careful assessment of the exposure, including detailed information on ADME. This was, for instance, done in the deltamethrin case mentioned above (EFSA PPR Panel,  2021 ). First, the evidence was integrated in an AOP conceptual framework, where a probabilistic quantification of the WoE was conducted to assess and quantify the uncertainty of the evidence. This ultimately allowed the assessment of whether the concentration of the specific compound, which targets the molecular initiating events (MIE) of the AOP, will be relevant or not for its activation. This assessment of the biological plausibility for causality of outcomes observed in epidemiological studies for regulated products could generally be adopted for any risk assessment. However, as for all frameworks, exceptions can arise. In the case of certain food additives, short‐term experimental studies in humans assessing tolerability are available, but long‐term studies assessing chronic disease risk have received much less attention (compared to the pesticide area). Therefore, the approach shown in Figure  6 may apply to food additives in most cases. KEY POINTS Evidence from human epidemiological data should be well‐integrated with findings from other streams of evidence. Structured, biologically driven ways for integrating the data from human observational studies with other data are (1) relating the data to already existing AOP(s) or postulating a new AOP or (2) assessment of Mode of Action. The approach for integrating information from different studies outlined in this guidance provides a structured framework for integrating and for deciding on precedence of different data. Evidence from human epidemiological data should be well‐integrated with findings from other streams of evidence. Structured, biologically driven ways for integrating the data from human observational studies with other data are (1) relating the data to already existing AOP(s) or postulating a new AOP or (2) assessment of Mode of Action. The approach for integrating information from different studies outlined in this guidance provides a structured framework for integrating and for deciding on precedence of different data. Evidence from human epidemiological data should be well‐integrated with findings from other streams of evidence. Structured, biologically driven ways for integrating the data from human observational studies with other data are (1) relating the data to already existing AOP(s) or postulating a new AOP or (2) assessment of Mode of Action. The approach for integrating information from different studies outlined in this guidance provides a structured framework for integrating and for deciding on precedence of different data. In this section, options for dose–response modelling for epidemiological data are addressed. Following a summary of the experience of modelling animal dose–response data using BMD modelling, the overall objectives for identifying reference points (RPs) from which HBGVs can be established are described. Then possible approaches to deriving human benchmark values are discussed. This includes reflections on the choice of benchmark response (BMR), and the application of benchmark approaches to different epidemiological datasets. An overview of alternatives to BMD modelling is also provided: (1) the use of meta‐analysis to derive pooled estimates of dose–response slopes from multiple studies; (2) the specific case of estimating a change in dose–response relationships indicative of a risk threshold; and (3) simple modelling of risk. Finally, the application of uncertainty factors is discussed. To prevent harm from a dietary component, the ideal goal is to establish a HBGV, which is defined as an amount that can be ingested over a defined time‐period without appreciable health risk. Formerly, this was done by extrapolating findings from laboratory animal data using the NOAEL as the RP and dividing by uncertainty factors to account for uncertainty in differences between and within species (see EFSA Scientific Committee,  2012 ). Alternatively, the RP may be divided by the estimated dietary exposure to set a margin of exposure 43 (MOE), (EFSA,  2005 ). For laboratory animal data, EFSA prefers the BMD approach to the NOAEL approach for identifying a RP, partly because it makes better use of the entire data of the dose–response curve compared to simple pairwise comparisons, and the use of the BMDL as a RP includes an adjustment for the uncertainty around the BMD. The methods for applying the BMD approach are described in detail in the EFSA guidance on the use of benchmark dose approach in risk assessment (EFSA Scientific Committee,  2022 ) and a software platform for modelling of controlled animal experiments in line with this guidance has been developed (Hasselt University,  2022 ). The principle involves modelling a number of plausible dose–response curves (e.g. Hill, gamma, exponential, probit, etc.) across the exposure groups, and calculating an exposure (BMD) corresponding to a defined response or BMR. The RP is then identified by taking the lower bound of the model average 95% credible interval of the BMD (the BMDL), which is built considering the posterior weights across all models. Such BMDLs are then calculated for each endpoint measured in the relevant studies. The various BMDLs are assessed in relation to the relevance of the outcome and the quality of the study, to lead to selection of a BMDL to be used as the RP for establishing HBGVs. In this process, both robustness of the underlying study and adversity of the outcome are taken into consideration (EFSA Scientific Committee,  2012 ). EFSA considers that it is not appropriate to establish HBGVs for compounds for which a threshold cannot be assumed, such as substances that are directly acting genotoxic carcinogens. In these circumstances, an MOE is estimated (EFSA,  2005 ). The Scientific Committee has concluded that an MOE of 10,000 or higher, if it is based on the BMDL10 from a laboratory animal study, would be of low concern from a public health point of view (EFSA,  2005 ). The goal of dose–response modelling is to identify an exposure RP, associated with one of three degrees of response: A non‐minimal adverse change 44 over background incidence specified as absolute incidence or relative risk for rare outcomes; and for a continuous endpoint a certain adverse change in the outcome. A threshold of exposure below which there is taken to be no appreciable adverse effect. Such a threshold should be taken into consideration, for example, if there is no increase in incidence of a quantal response, or no change in a continuous measure. Minimisation of risk when there is a U‐shaped relationship between exposure and health effects, as is frequently the case for nutrients for which both deficiency and excess exposure tend to have adverse effects. A non‐minimal adverse change 44 over background incidence specified as absolute incidence or relative risk for rare outcomes; and for a continuous endpoint a certain adverse change in the outcome. A threshold of exposure below which there is taken to be no appreciable adverse effect. Such a threshold should be taken into consideration, for example, if there is no increase in incidence of a quantal response, or no change in a continuous measure. Minimisation of risk when there is a U‐shaped relationship between exposure and health effects, as is frequently the case for nutrients for which both deficiency and excess exposure tend to have adverse effects. These RP values can then be used to establish a HBGV or MoE. This may be done by applying different adjustment factors to the RP, which are related to, for example, sensitive subgroups. KEY POINTS Human dose–response modelling in food safety assessments requires a broad range of modelling tools in addition to BMD modelling. Modelling of U‐shaped relationships for nutrients (dual‐risk) and identification of thresholds for health effects or other biological responses are some examples of the methods frequently applied for human data. Human dose–response modelling in food safety assessments requires a broad range of modelling tools in addition to BMD modelling. Modelling of U‐shaped relationships for nutrients (dual‐risk) and identification of thresholds for health effects or other biological responses are some examples of the methods frequently applied for human data. Human dose–response modelling in food safety assessments requires a broad range of modelling tools in addition to BMD modelling. Modelling of U‐shaped relationships for nutrients (dual‐risk) and identification of thresholds for health effects or other biological responses are some examples of the methods frequently applied for human data. As discussed in Section  4.4.3.1 , the BMD modelling approach has been the preferred approach in EFSA for modelling data from controlled animal experiments (EFSA Scientific Committee,  2022 ). For such data, the background response should, in the absence of cross‐contamination, be well defined by the un‐exposed controls. Furthermore, as the laboratory animals are randomised into exposure groups, there should not be a problem of confounding variables varying between exposure groups. Lastly, with proper dose selection the whole sigmoidal dose–response curve from background to maximum response should be captured. However, in studies assigning only few animals confounding, such as by sex, may occur that needs to be accounted for. The maximum response may also not be accurately captured if the number of dose groups is small, or the highest dose is not sufficiently high. In principle, the current EFSA Guidance for BMD modelling would work well for human experimental data, but for obvious ethical reasons such data rarely exist. There are rare exceptions such as in EFSA's 2014 opinion on perchlorate (EFSA CONTAM Panel,  2014 ). Similar data availability may also occur in nutrition where experimental data on food supplements is frequently available (revealing side effects or unexpected adverse events). However, in most cases one can expect that available experimental data would not be compatible with multi‐dose RCT design that the current BMD guidance addresses. The reason being that prior to conducting such experiments (phase III), the safety and tolerability of a substance is usually tested in smaller phase 0, I or II trials and any sign of harm would not justify further experiments (see Section  4.1.2.1 ). These studies usually suffer from correlated observations (dose escalation trials)/or small dose groups which may introduce confounding (e.g. imbalance in sex, smoking, age, or other factors). However, the above‐mentioned challenges with experimental studies in humans could, perhaps, be more easily addressed compared to trying to model human observational data. That is, the modelling of human observational data needs special considerations and deviations from existing BMD guidance (EFSA Scientific Committee,  2022 ) for the following reasons: There are likely differences between exposure groups for several confounders, which need to be adjusted for. For quantile data, results are usually presented in terms of adjusted risk measures relative to a reference group, and the baseline risk level of 1 does not have a standard error associated with it. For continuous outcome data, there may be methodological challenges in establishing the baseline level at zero exposure With individual or grouped exposure levels, the baseline risk level is likely not zero, so the BMR would not be relative to zero exposure. There are likely differences between exposure groups for several confounders, which need to be adjusted for. For quantile data, results are usually presented in terms of adjusted risk measures relative to a reference group, and the baseline risk level of 1 does not have a standard error associated with it. For continuous outcome data, there may be methodological challenges in establishing the baseline level at zero exposure With individual or grouped exposure levels, the baseline risk level is likely not zero, so the BMR would not be relative to zero exposure. These considerations are not addressed in the 2022 guidance (EFSA Scientific Committee,  2022 ), but other research groups have published on these topics (Budtz‐Jørgensen et al.,  2001 ; Whitney & Ryan,  2013 ). The reference group in most observational studies with exposure grouping is the lowest exposure group, not a true zero exposure group. So, the BMR may need to be defined as the incremental increase relative to that of non‐zero exposure. This may vary between studies, affecting the calculated BMD. To bypass this, assumptions on background response at zero concentrations need to be made. In many cases, no health risk would be expected at very low exposure levels, and therefore, the non‐zero exposure group could be defined as a reference. If the exposure–response relationship is very flat/shallow at the lowest exposure group relative to higher exposure groups, the error introduced when estimating the BMD by using the lower exposure group (because the response at zero dose is not known) should be minimal. This can be checked by sensitivity analysis. Unadjusted observational epidemiological data can be modelled ignoring risk factors other than the exposure of interest, but such modelling is vulnerable to the presence of confounding, so this is not advised. Current BMD modelling platforms developed by EFSA, RIVM and US EPA do not allow inclusion of multiple covariates in the models nor direct modelling of relative risk. While not yielding adjusted BMD values, the EFSA platform allows some evaluation of categorical covariates (Hasselt,  2022 ). In this approach, the data are stratified by the covariate categories and BMDs calculated for each stratum. For multiple covariates, a single stratifying variable comprising all combinations of covariates (e.g. sex and age groups) would need to be created. In many cases, this will have the disadvantage of generating one or more strata with sparse numbers. Such analyses may potentially highlight the most sensitive stratum, which could then be selected as RP. This approach has been used to calculate BMDLs in simple subgrouping cases, e.g. sex‐specific BMDLs. In cases where the BMD results are the same across these strata, the overall BMD and BMDL can be calculated in the normal way for the whole data and be considered unconfounded. However, in the presence of confounding, this stratification approach does not allow the direct estimate of BMDLs adjusted for these confounders. It is frequently the case that several covariates need to be adjusted for. In such situations, covariate adjustment should ideally be made using multivariable regression. Furthermore, to fully implement such approaches and to comply with the legal constraints of sharing individual participant data, the software would need to be downloadable so full data could be modelled by data owners instead of relying on summary data. 45 Looking ahead, a solution would be to develop new versions of BMD software programs better suited to epidemiological data, which would allow adjustment for multiple covariates and modelling of RR in line with how human observational data are traditionally modelled. In the absence of such software, there are ways to use existing BMD software and still address the three concerns listed above. This has been done to varying degree in previous EFSA opinions primarily for continuous outcomes (see examples provided in Section  4.4.3.4.5 below). Modelling of risk ratios, however, had not been performed until EFSA's 2024 opinion on inorganic arsenic (iAs) (EFSA CONTAM Panel,  2024 ) where the following indirect approach to modelling adjusted relative risk was used: Since the current BMD approach is not designed to model relative risk estimates such as IRR, HR, or OR, it was necessary to transform the relative risks to natural numbers/integers. This was based on the approach used by JECFA (FAO and WHO, 2011 ). For cohort studies, the incidence rate or the cumulative incidence in the reference category was calculated. This incidence was multiplied by the adjusted risk estimate to obtain an adjusted incidence estimate, which was then used to calculate the “adjusted number of cases” (as integers). Having obtained the “adjusted number of cases” and the population size (provided in the papers or calculated from number of cases and incidence rates) in each exposure category, these data could be used as input into the EFSA BMD webtool. Since the current BMD approach is not designed to model relative risk estimates such as IRR, HR, or OR, it was necessary to transform the relative risks to natural numbers/integers. This was based on the approach used by JECFA (FAO and WHO, 2011 ). For cohort studies, the incidence rate or the cumulative incidence in the reference category was calculated. This incidence was multiplied by the adjusted risk estimate to obtain an adjusted incidence estimate, which was then used to calculate the “adjusted number of cases” (as integers). Having obtained the “adjusted number of cases” and the population size (provided in the papers or calculated from number of cases and incidence rates) in each exposure category, these data could be used as input into the EFSA BMD webtool. This approach has the advantage of being able to use the existing BMD modelling platform to covariate adjusted data. In some cases, such use of data may introduce errors, for example if the numbers of observed cases in exposure groups are small, as the rounding to integers may mean that the RRs are somewhat changed. While BMD modelling is well characterised for laboratory animal studies, it is novel for epidemiological data. As epidemiological data have the advantage of avoiding the uncertainty of animal to human extrapolation, there is a need for guidance on BMD modelling of human data so that it can be consistently used for risk characterisation in EFSA's assessments. Although developing such guidance is outside of the scope of this document, main principles and considerations for dose–response modelling using data from human observational studies are outlined below. These considerations represent both the current status and suggestions for further empirical and/or modelling studies. KEY POINTS Although many of the principles laid out in the EFSA guidance on BMD modelling for controlled animal experiments may also apply for human data, a direct application is rarely feasible due to differences in study designs and the nature of human data. This may equally apply to modelling of human experimental data as such data, relevant for chemical risk assessment, are often not compatible with multidose randomised controlled experiments. Although modelling of human data has been performed by EFSA on several occasions, those efforts have had to overcome limitations in existing software platform that are designed to model specific type of experimental data. Adjustment for multiple covariates and modelling of relative risks are key specific aspects that BMD modelling software have to address to allow for appropriate modelling of human observational data. Although many of the principles laid out in the EFSA guidance on BMD modelling for controlled animal experiments may also apply for human data, a direct application is rarely feasible due to differences in study designs and the nature of human data. This may equally apply to modelling of human experimental data as such data, relevant for chemical risk assessment, are often not compatible with multidose randomised controlled experiments. Although modelling of human data has been performed by EFSA on several occasions, those efforts have had to overcome limitations in existing software platform that are designed to model specific type of experimental data. Adjustment for multiple covariates and modelling of relative risks are key specific aspects that BMD modelling software have to address to allow for appropriate modelling of human observational data. Although many of the principles laid out in the EFSA guidance on BMD modelling for controlled animal experiments may also apply for human data, a direct application is rarely feasible due to differences in study designs and the nature of human data. This may equally apply to modelling of human experimental data as such data, relevant for chemical risk assessment, are often not compatible with multidose randomised controlled experiments. Although modelling of human data has been performed by EFSA on several occasions, those efforts have had to overcome limitations in existing software platform that are designed to model specific type of experimental data. Adjustment for multiple covariates and modelling of relative risks are key specific aspects that BMD modelling software have to address to allow for appropriate modelling of human observational data. For BMD modelling of an epidemiological study, there are certain minimum conditions. There need to be multiple exposure groups, including one group with little or no exposure. If data are not grouped, the exposure needs to cover a wide range, including little or no exposure. In many cases, there will be several outcomes of interest, and for each outcome, there will be multiple studies, each of which may provide a BMD. Before carrying out BMD modelling, the BMR needs to be defined (in terms of additional absolute risk or relative risk). The selection of the BMR from human data shares common principles with the selection of BMRs from laboratory animal data. 'The BMR is a degree of change that defines a level of response in a specific endpoint that is measurable, considered relevant to humans or to the model species, and that is used for estimating the associated dose (the "true" BMD)' (EFSA Scientific Committee,  2022 ). For continuous outcomes, the BMR is, ideally, the smallest measurable change that reflects adversity. In human studies, the link between some clinical biomarkers and disease endpoints is well established, and that could be considered when selecting a BMR. For quantal outcomes with relatively low absolute risk, the relative risk approach is more appropriate than modelling absolute risk (see examples in Section  4.2.1.1 ). Given the BMR is a direct estimate of effects in the underlying study population, further adjustment factors may be applied to the BMDL to establish a HBGV. No clear guidance on when such factors should be applied exist for human data, but some considerations and past examples are highlighted in Section  4.4.3.4.5 below. Epidemiology data which fit most readily into the BMD approach established for laboratory animal studies are quantal data from prospective cohorts with a common outcome (e.g. CVD). Ideally such studies should have sufficient follow‐up time. The data to be entered reflect the adjusted incremental risk per group and are based on the size of each exposure group and the number of cases in each group. However, the crude (unadjusted) observational data cannot be used as this is likely subject to confounding. Similar to the approach used by the CONTAM panel for inorganic arsenic described above (EFSA CONTAM Panel,  2024 ), the number of cases in each group above reference can be estimated from the adjusted relative risk, rounded to the nearest integer and entered into the BMD modelling platform. The error of rounding is relatively minor if groups are not too small. 46 The BMDL can then be calculated at the BMR. However, when studies with low incidence of the outcome are modelled, a BMR based on absolute low risk (say 1%) is usually not going to be reached. In such cases, modelling the relative increase in risk would make more sense. The same calculations would be done but the target BMR would be defined as a relative risk increase of, e.g. 5% relative to the reference group rate. 47 Another reason for focusing on relative risk is due to limited follow‐up in many studies. That is, the observed absolute risk during insufficient follow‐up may underestimate the lifetime risk given sufficient follow‐up, but the relative risk more appropriately captures the effect of exposure. Another concern when applying the BMD approach to human data is whether the 'low exposure' reference category in epidemiology can properly be considered as equivalent to the zero‐dose referent category in experiments. With individual or grouped exposure levels, the baseline risk level is likely not zero, so the BMR would not be relative to zero exposure. If there is a threshold exposure–response relationship and the lowest exposure category is below that threshold, then this category still provides a reasonable baseline equivalent to the risk level at a true zero exposure. If the baseline category is close to the general background population exposure level, then it can be assumed to reflect the real‐world contrast between additional exposure and unavoidable exposure. However, if the lowest exposure group is, on average, substantially exposed, then this is a weakness that needs to be acknowledged in the review of study‐specific BMDLs. 48 The BMD modelling approach with grouped exposure data and quantal outcome data suggested for cohort studies can be extended to case–control data but this requires some additional assumptions. If we assume that all cases are collected from a given base population, and the controls provide a sufficient estimate of the exposure distribution of the base population, then there is data on the total population in each exposure group. The numbers of cases above the reference population can be adjusted to conform with the odds ratio in a comparable manner to the preceding case for cohort data. The data can be processed to calculate BMDLs to a BMR by considering the target relative risk or odds ratio (EFSA CONTAM Panel,  2024 ). BMD approaches can also be applied to continuous outcomes (e.g. cholesterol, glucose or antibodies in serum, IQ score, lung function tests). The BMR should reflect a minimal change which is adverse and, therefore, it will depend on the nature of the endpoint selected. No default values exist and the BMR should be based on health considerations, but some examples based on previous EFSA assessments are given in next section. KEY POINTS Selection of benchmark response for human benchmark dose modelling follows the same principle as laid out in the EFSA guidance for benchmark dose modelling for controlled studies in experimental animals. Based on biological and modelling considerations, it needs to be assessed on a case‐by‐case basis whether 'low exposure' reference category in human observational studies can be considered as equivalent to the zero‐dose referent category in experiments. In many cases, the associated uncertainties are marginal and easily dealt with. Modelling of adjusted continuous outcomes where exposure is categorised is relatively straight forward with existing benchmark dose modelling software. For quantal health outcomes, modelling of adjusted incident data is more challenging with existing software, but the approach used for inorganic arsenic by EFSA is one example of how such data can be modelled based on several assumptions. Modelling of unadjusted observational data is strongly discouraged. The principles of human benchmark dose modelling should be used as starting point for developing guidance for human benchmark dose modelling. Such guidance may require some modification of the existing benchmark dose modelling framework to make it compatible with modelling of observational data. Selection of benchmark response for human benchmark dose modelling follows the same principle as laid out in the EFSA guidance for benchmark dose modelling for controlled studies in experimental animals. Based on biological and modelling considerations, it needs to be assessed on a case‐by‐case basis whether 'low exposure' reference category in human observational studies can be considered as equivalent to the zero‐dose referent category in experiments. In many cases, the associated uncertainties are marginal and easily dealt with. Modelling of adjusted continuous outcomes where exposure is categorised is relatively straight forward with existing benchmark dose modelling software. For quantal health outcomes, modelling of adjusted incident data is more challenging with existing software, but the approach used for inorganic arsenic by EFSA is one example of how such data can be modelled based on several assumptions. Modelling of unadjusted observational data is strongly discouraged. The principles of human benchmark dose modelling should be used as starting point for developing guidance for human benchmark dose modelling. Such guidance may require some modification of the existing benchmark dose modelling framework to make it compatible with modelling of observational data. Selection of benchmark response for human benchmark dose modelling follows the same principle as laid out in the EFSA guidance for benchmark dose modelling for controlled studies in experimental animals. Based on biological and modelling considerations, it needs to be assessed on a case‐by‐case basis whether 'low exposure' reference category in human observational studies can be considered as equivalent to the zero‐dose referent category in experiments. In many cases, the associated uncertainties are marginal and easily dealt with. Modelling of adjusted continuous outcomes where exposure is categorised is relatively straight forward with existing benchmark dose modelling software. For quantal health outcomes, modelling of adjusted incident data is more challenging with existing software, but the approach used for inorganic arsenic by EFSA is one example of how such data can be modelled based on several assumptions. Modelling of unadjusted observational data is strongly discouraged. The principles of human benchmark dose modelling should be used as starting point for developing guidance for human benchmark dose modelling. Such guidance may require some modification of the existing benchmark dose modelling framework to make it compatible with modelling of observational data. It would be desirable for future iterations of the current EFSA BMD software platform (Hasselt University,  2022 ) to allow for direct input of continuous exposure data (i.e. at individual level) that are not divided into subgroups. Such data may be in individual studies or may be combined by well‐established rules for meta‐analysis, leading to a pooled, more precise slope for the exposure–response. There has been some experience of using BMD modelling for continuous human data in EFSA assessments. Those assessments provide some examples on how decisions on BMRs vary depending on the outcome. For cadmium, urinary cadmium was related to the BMR of 5% of having beta‐2‐microglobulin (B2M) levels exceeding 300 μg/g creatinine (EFSA, 2009 ). In the EFSA opinion on perchlorate BMD modelling of thyroidal radioiodine uptake, a BMR of 5, 10 and 20% was based on modelling a human intervention study (dose escalation trial) (EFSA CONTAM Panel,  2014 ). For lead (Pb), a 1% BMR was selected for two continuous outcomes – decrease in IQ score and increase in systolic blood pressure corresponding to an absolute change of 1‐IQ point and 1.2 mmHg, respectively (EFSA CONTAM Panel,  2010 ). A 10% BMR for one quantal outcome, for the change in the prevalence of chronic kidney disease (CKD) was also quantified. For the opinion of perfluorooctane sulfonate (PFOS) and perfluorooctanoic acid (PFOA) in 2018, a BMR of 5% relative increase in serum cholesterol was used for both compounds; and in a later revised opinion on perfluoroalkyl substances (PFAS) from 2020, a BMR of 10% absolute reduction in antibody response (antibody titres found using serological tests) was used for the sum of four PFAS (EFSA CONTAM Panel,  2018 , 2020 ). For cadmium, urinary cadmium was related to the BMR of 5% of having beta‐2‐microglobulin (B2M) levels exceeding 300 μg/g creatinine (EFSA, 2009 ). In the EFSA opinion on perchlorate BMD modelling of thyroidal radioiodine uptake, a BMR of 5, 10 and 20% was based on modelling a human intervention study (dose escalation trial) (EFSA CONTAM Panel,  2014 ). For lead (Pb), a 1% BMR was selected for two continuous outcomes – decrease in IQ score and increase in systolic blood pressure corresponding to an absolute change of 1‐IQ point and 1.2 mmHg, respectively (EFSA CONTAM Panel,  2010 ). A 10% BMR for one quantal outcome, for the change in the prevalence of chronic kidney disease (CKD) was also quantified. For the opinion of perfluorooctane sulfonate (PFOS) and perfluorooctanoic acid (PFOA) in 2018, a BMR of 5% relative increase in serum cholesterol was used for both compounds; and in a later revised opinion on perfluoroalkyl substances (PFAS) from 2020, a BMR of 10% absolute reduction in antibody response (antibody titres found using serological tests) was used for the sum of four PFAS (EFSA CONTAM Panel,  2018 , 2020 ). Besides continuous data, there are also some examples of BMR being used for quantal outcomes: In the EFSA opinion on perchlorate BMD modelling of cutaneous effects (quantal outcome data) a BMR of 1% and 10% was used. This was done by merging data from three independent human interventions (EFSA CONTAM Panel,  2014 ). BMD modelling of human data based on quantal outcomes (e.g. cancer) was performed in the update of EFSA's risk assessment of inorganic arsenic in food (EFSA CONTAM Panel,  2024 ). The observed cumulative incidence (the ratio between the number of cases and the size of the source population over the observation time) in the assessed studies was estimated to be around 0.02%. Moreover, for the assessed cancer endpoints, a BMR of 1%–5%, expressed as relative increase of the background incidence after adjustment for confounders, was regarded to be relevant for public health. Thus, a BMR of 5% was used of 0.06 μg iAs/kg bw per day obtained from a study on skin cancer as a RP. In the EFSA opinion on perchlorate BMD modelling of cutaneous effects (quantal outcome data) a BMR of 1% and 10% was used. This was done by merging data from three independent human interventions (EFSA CONTAM Panel,  2014 ). BMD modelling of human data based on quantal outcomes (e.g. cancer) was performed in the update of EFSA's risk assessment of inorganic arsenic in food (EFSA CONTAM Panel,  2024 ). The observed cumulative incidence (the ratio between the number of cases and the size of the source population over the observation time) in the assessed studies was estimated to be around 0.02%. Moreover, for the assessed cancer endpoints, a BMR of 1%–5%, expressed as relative increase of the background incidence after adjustment for confounders, was regarded to be relevant for public health. Thus, a BMR of 5% was used of 0.06 μg iAs/kg bw per day obtained from a study on skin cancer as a RP. Modelling approaches other than the BMD have been applied when assessing dose–response in human observational data. These include dose–response meta‐analyses for establishing dietary reference values (DRV)/HBGV/UL for nutrients, modelling approaches to detect changes in risk or incidence (e.g. in cancer research); and assessing excess risk for chemical exposure in the occupational setting. Although there has been somewhat less focus on the use of these approaches in chemical risk assessment compared to the BMD, this may be more related to the frequent use of data from experimental animals for establishing HBGV rather than the appropriateness of these modelling approaches. Below, a short description of alternative methods, their suitability and pros and cons are provided. The choice of method depends both on the nature of the data and specific objective of the modelling. Therefore, in this document no general prescription on which method to use can be provided; the choice needs to be made on a case‐by‐case basis using expert judgement. A fundamental statistical tool that is playing a key role in risk assessment is the implementation of dose–response meta‐analysis, which is based on a flexible modelling framework that can incorporate epidemiological studies encompassing different levels and categories (2 or more) of exposure. Until recently, dose–response meta‐analyses have been based on forest plots, which have the limitation of ignoring the heterogeneity of exposure categories across studies and the shape of the dose–response relation (Vinceti et al.,  2020 ). Recently, the use of the so‐called one‐stage or mixed‐effects framework, allowing the synthesis of tables of empirical contrasts, has become common, allowing the estimation of heterogeneous and frequently curvilinear dose–response relations. This method is generally based on the described random‐effects dose–response restricted cubic spline modelling using a one‐stage mixed effect meta‐analytic model for aggregated data (Crippa et al.,  2018 ; Orsini et al.,  2012 ; Orsini & Spiegelman,  2021 ; Sera et al.,  2019 ; Vinceti et al.,  2020 ). In a meta‐analysis, parameter estimates are generally obtained with the restricted maximum likelihood method, and statistical inference typically focuses on the summary dose–response relation. Use of these methods to carry out dose–response meta‐analysis allows the identification and characterisation of complex relations between exposure and endpoints, such as chronic disease risk associated with intake of nutrients like potassium, sodium, manganese and selenium, and of contaminants like acrylamide and cadmium (Adani et al.,  2020 ; Filippini et al.,  2021 ; Filippini et al.,  2022 ; Vinceti et al.,  2016 ). In such cases, adverse health effects may arise at too low and/or too high exposure and characterising the shape of such patterns of association is of paramount value in risk assessment. In these instances, the use of linear functions, such as linear regression analysis, would likely lead to wrong statistical inferences and conclusions. In nutritional risk assessment, non‐linear dose–response meta‐analytic modelling is of key importance not only in shaping the exposure–disease relations, but also when characterising the relation between intake and biomarkers of exposure, or more generally quantitative variables. Conversely, when some studies do not report findings in a way that is suitable for dose–response meta‐analyses, their omission may reduce the body of evidence available for assessment. The availability of a large number of studies for the dose–response meta‐analysis is also very important not only to yield more precise risk/effect estimates, but also to: (1) broaden the range of exposure for which the risk assessment is conducted, (2) identify population characteristics that may act as effect modifiers in the exposure–endpoint relation, (3) carry out sensitivity analyses according to the RoB of the studies; (4) assess publication bias or small study effect. Of relevance are also methodological issues such as the general preference in using the most adjusted estimates from the studies to be included in the meta‐analyses, the selection of the RP in plotting the curves of RR/effect estimates in relation with the Y‐axis, the selection of the knots, i.e. the fixed points among which the curves are smoothly interpolated in the most commonly used flexible method to model non‐linear functions, i.e. the restricted cubic spline function (Crippa & Orsini,  2016 ; Orsini et al.,  2012 ). Dose–response meta‐analyses based on cubic spline modelling are largely used for both continuous (e.g. blood glucose, blood pressure) and dichotomous (disease occurrence) endpoints (Vinceti et al.,  2020 ). KEY POINTS Dose–response‐meta‐analysis is a more robust way of integrating data from several studies compared to traditional forest plots used on meta‐analyses that generally ignore the heterogeneity of exposure. In data rich cases, the use of dose–response meta‐analyses may also provide a more realistic and more comprehensive assessment of an underlying exposure health relationship than relying on the results from a single or a few studies. Dose–response‐meta‐analysis is a more robust way of integrating data from several studies compared to traditional forest plots used on meta‐analyses that generally ignore the heterogeneity of exposure. In data rich cases, the use of dose–response meta‐analyses may also provide a more realistic and more comprehensive assessment of an underlying exposure health relationship than relying on the results from a single or a few studies. Dose–response‐meta‐analysis is a more robust way of integrating data from several studies compared to traditional forest plots used on meta‐analyses that generally ignore the heterogeneity of exposure. In data rich cases, the use of dose–response meta‐analyses may also provide a more realistic and more comprehensive assessment of an underlying exposure health relationship than relying on the results from a single or a few studies. For certain disease outcomes, such as cancer or CVD, the primary interest when quantifying risk may not necessarily be assessment of the whole dose–response but rather to estimate at what exposure level a statistically significant change occurs in the slope for risk (or incidence). This could, for example, include assessing the level of exposure associated with significant changes in incidence or risk from background. Such modelling can both be done using individual participant data or summary data. Although this type of modelling has not been commonly applied in risk assessment within food safety, it has been extensively used in cancer research for monitoring changes in cancer incidence over time in different populations. The National Institute of Cancer, has for example, specifically developed a software, called JoinPoint (Kim et al.,  2000 ), to estimate changes in cancer incidence over time. The method is based on fitting a piecewise linear model to the dose–response data. The presence of change‐points are then assessed by fitting where on the dose–response curve the slope of the linear segment changes. The resulting change‐point is estimated along with associated uncertainty (based on the standard error). Although this method has primarily been used to assess changes in cancer incidence over time, this methodology is increasingly being applied to model epidemiological studies, such as assessing the level of compliance needed for antihypertensive medication to significantly reduce later risk of CVD (Yang et al.,  2017 ); or assessing where selenoprotein P concentrations plateau in relation to blood selenium (Hurst et al.,  2010 ). In addition to the JoinPoint package, other software is available. The package 'Segmented' in R 49 allows for the estimation of change‐points through regression analyses where adjustment for covariates can also be performed. This package in R was, for example, used in a recent publication to assess the change‐point (Figure  7A ) at which antibody titres started to decrease significantly with higher serum PFAS concentrations in 1‐year old children (Source: Abraham et al.,  2020 ). The 'Segmented' package in R can also be used for quantile data. Although no example within the area of food safety is available, the use of this method is well illustrated in the modelling of the relationship between maternal age and Down syndrome (Figure  7B ); Source: Muggeo,  2008 (from Davidson & Hinkley, 1997)). Examples of applications of segmented regression. (A) The association between serum plasma concentrations of PFOA in 1‐year‐old children in relation to adjusted antibody concentrations to tetanus (Source: Abraham et al.,  2020 ). The red line is the fitted piece‐wise linear function while the broader grey line is a moving average. The vertical grey line shows the PFOA concentration where the change‐point is identified (16.9 ng/mL). (B) How a piece‐wise linear function is fitted through a data for the association between maternal age and percentage of children being born with Down syndrome (Source: Muggeo,  2008 (from Davidson & Hinkley, 1997)). The red line identifies a change‐point at 31.1 y (Std error of 0.7 y). However, as the first segment of the line was not significantly different from null the fitted green line was fitted with the constraint the first slope being β 1  = 0. In that case the change‐point was estimated as 31.5 y (Std. error of 0.6 y). Primarily biological considerations and the nature of the endpoint should be guiding the selection of change‐points. Assessment of change‐points may be relevant when the underlying dose–response data in the study of interest appears to show a threshold for effect. The uncertainty around the change‐point can be quantified objectively, based on the standard errors. This modelling approach may be of relevance for certain disease endpoints such as cancer and CVD, where assessment of when a significant change in risk (or incidence) from background occurs may be of more interest than assessing where a relative or absolute BMR occurs. The identified change‐point can then be used for further risk characterisation. KEY POINTS The use of statistical methods assesses the presence of change‐points, where a significant change in risk or response occurs relative to baseline is one option for characterising risk. Although use of such methods has been traditionally confined to cancer research, this methodology is increasingly being used for other health related outcomes. Available software packages make the use of this methodology relatively straight forward and allow for quantification of uncertainty. The use of statistical methods assesses the presence of change‐points, where a significant change in risk or response occurs relative to baseline is one option for characterising risk. Although use of such methods has been traditionally confined to cancer research, this methodology is increasingly being used for other health related outcomes. Available software packages make the use of this methodology relatively straight forward and allow for quantification of uncertainty. The use of statistical methods assesses the presence of change‐points, where a significant change in risk or response occurs relative to baseline is one option for characterising risk. Although use of such methods has been traditionally confined to cancer research, this methodology is increasingly being used for other health related outcomes. Available software packages make the use of this methodology relatively straight forward and allow for quantification of uncertainty. Other modelling approaches have, compared to the BMD approach, been applied when assessing dose–response in human observational data. These include dose–response meta‐analyses discussed in the previous section, and various approaches to establish the shape of the exposure–response relationship including simple linear relationships. The most common approach has been to estimate the slope of the dose–response curve (in terms of a disease rate, risk, hazards ratio or odds, usually log transformed) with an extrapolation to zero exposure. Based on that extrapolation, the exposure equivalent to an excess risk of say 1/1000 or 1/10,000 can be estimated (Steenland & Deddens,  2004 ). This approach can be extended to continuous variables. Splines or linear functions are commonly used for this purpose, but other models can be applied as well. The RP would be identified based on the excess risk which forms the basis for establishing the HBGV with or without the use of an uncertainty factor (UF) (see Section  4.4.3 ). Different approaches and conventions exist for choosing a function for such modelling. Today the use of splines is common and well accepted, frequently in the form of restricted cubic splines. The use of splines is well suited to capture any deviation from non‐linearity and generally provides better flexibility than polynomials. The use of linear models is also common in epidemiology and may be justified in cases where strong indication of linearity exists, depending on whether or not variables are on a log or linear scale. Approximate linear relationships may occur in human epidemiology when the exposure range is small (narrow). That is, the full non‐linearity of a relationship is not captured. Furthermore, if variability of the exposure and/or the outcome is high, as it is often the case in human studies, it may be difficult to distinguish between a linear and non‐linear response. Often the assumption on linearity beyond the observed exposure range is simpler and easier to interpret than extrapolation based on non‐linear functions. One way to justify the use of linear functions is to test if a significantly better fit is obtained (e.g. based on the residual sum of squares) when using a non‐linear function such as splines. If the non‐linear function does not provide a significantly better fit, the use of a linear function is partly justified. KEY POINTS Modelling of risk using simple linear or non‐linear function to quantify the excess risk relative to a reference category or by extrapolating to zero exposure is one alternative to benchmark dose modelling. Such an approach may be considered when conditions for benchmark dose modelling are, for varying reasons, not met, or when a simpler modelling approach is considered more appropriate. The corresponding reference point based on the excess risk would form the basis of the HBGV with or without the use of an UF. Modelling of risk using simple linear or non‐linear function to quantify the excess risk relative to a reference category or by extrapolating to zero exposure is one alternative to benchmark dose modelling. Such an approach may be considered when conditions for benchmark dose modelling are, for varying reasons, not met, or when a simpler modelling approach is considered more appropriate. The corresponding reference point based on the excess risk would form the basis of the HBGV with or without the use of an UF. Modelling of risk using simple linear or non‐linear function to quantify the excess risk relative to a reference category or by extrapolating to zero exposure is one alternative to benchmark dose modelling. Such an approach may be considered when conditions for benchmark dose modelling are, for varying reasons, not met, or when a simpler modelling approach is considered more appropriate. The corresponding reference point based on the excess risk would form the basis of the HBGV with or without the use of an UF. The situations in which use of uncertainty factors is considered relate primarily to the uncertainty around the identified RP for establishing HBGVs and the level of protection that is aimed at. The uncertainty can for example relate to the external validity of the study used to identify a RP. For example, a RP identified in a population of healthy adults may not necessarily be protective in the case of more vulnerable population subgroups, such as pregnant women or the elderly. Similarly, the use of uncertainty factors may be appropriate when there is limited information on dose–response. This is quite common in the area of nutrition, where adverse effects are observed as a result of high intake in an observational setting, or where adverse effects are observed in food supplementation trials. In these situations, it is often not possible to identify the specific intake level at which adversity occurs, and limited information exists on the level of risk associated with slightly lower intake. In such cases, uncertainty factors have often been applied (EFSA,  2005 ). As no specific guidance exists on applying UFs when the RP is identified from human data besides the application of a 10‐fold UF for accounting for human variability, case‐by‐case assessments relying on expert judgement need to be made. If available, an a priori standard UF may be a starting point for the process of selecting a final UF in the risk assessment. However, such UFs should always be assessed, tailored to the real data and overall evidence available, and eventually adapted based on expert judgement. Factors to consider for applying or not an UF could, for example, be: whether the pivotal studies appropriately represent the general population or the population relevant for the risk assessment, to account for uncertainties of the methodology used for assessing the exposure or the health outcome, whether the health outcome investigated is a primary or surrogate measure for the health outcome, whether physiological requirements need to be considered, such as for establishing upper levels for nutrients, uncertainties associated with deriving intake from biomarkers of exposure. whether the pivotal studies appropriately represent the general population or the population relevant for the risk assessment, to account for uncertainties of the methodology used for assessing the exposure or the health outcome, whether the health outcome investigated is a primary or surrogate measure for the health outcome, whether physiological requirements need to be considered, such as for establishing upper levels for nutrients, uncertainties associated with deriving intake from biomarkers of exposure. A few examples on how these considerations have been applied previously are provided below. In the re‐evaluation of the existing HBGVs for copper (EFSA Scientific Committee,  2023b ), the established HBGV was based on retention of copper (a predictor of future toxicity), on the basis of measurements of excretion in healthy individuals. Although the pivotal study was conducted in healthy individuals that may not represent the general population, the fact that copper retention was a surrogate measure for the health outcome (liver toxicity), the RP identified was considered sufficiently protective for most consumers over long‐term. Use of UF was, therefore, not considered necessary. In case of selenium, an UF of 1.3 was applied to account for the uncertainties associated with extrapolating findings from a large RCT carried out in the US to the general risk assessment for the European population. The scientific justification for the UF of 1.3 was based on expert judgement taking into account various considerations, among them the uncertainty around the dose–response due to the use of single‐dose trials and current dietary intakes in the EU. These are explained in detail in the Scientific Opinion (EFSA NDA Panel,  2023 ). An example of adjusting for uncertainty due to the use of a biomarker of exposure in establishing a HBGV, can be found in the CONTAM Panel assessment of mercury (EFSA CONTAM Panel,  2012 ). The exposure estimation was based on the mercury concentration in hair samples from mothers and converted to maternal blood concentrations. Here a data‐driven chemical specific adjustment factor of 2 was applied, in addition to the standard factor for interindividual toxicokinetic variability of 3.2., to adjust for variability in the hair to blood mercury ratio. The resulting UF was thus 6.4. KEY POINTS The recommended approach for application of uncertainty factors for deriving health‐based guidance values should be based on the case at hand, tailored to the available data and based on the uncertainty analysis of the data. Currently, no specific guidance exists for using uncertainty factors when risk characterisation is based on human studies. The recommended approach for application of uncertainty factors for deriving health‐based guidance values should be based on the case at hand, tailored to the available data and based on the uncertainty analysis of the data. Currently, no specific guidance exists for using uncertainty factors when risk characterisation is based on human studies. The recommended approach for application of uncertainty factors for deriving health‐based guidance values should be based on the case at hand, tailored to the available data and based on the uncertainty analysis of the data. Currently, no specific guidance exists for using uncertainty factors when risk characterisation is based on human studies. Recommendations for EFSA risk assessments Evidence from epidemiological studies in humans should be used in risk assessments to the extent possible. The overall assessment should consider the entire body of evidence. Judgements on the overall body of evidence should always be made by considering the type, and, if possible, direction and magnitude of potential biases identified across different studies, for example by using a triangulation approach. To facilitate more structured and time efficient risk assessment, the use of evidence maps and scoping reviews during the planning phase of an Opinion is recommended. RoB tools provide a structured way to identify different biases that may occur to varying degrees in different studies. The key elements to capture within each study are the source, magnitude and direction of possible biases. That complexity cannot be accurately captured by assigning a numerical score of study quality, which is therefore discouraged. The type of dose–response modelling for risk or benefit characterisation should be selected based on the type and nature of the available data and the objective aimed for (minimising risk, maximising benefits or balancing the two). Evidence from epidemiological studies in humans should be used in risk assessments to the extent possible. The overall assessment should consider the entire body of evidence. Judgements on the overall body of evidence should always be made by considering the type, and, if possible, direction and magnitude of potential biases identified across different studies, for example by using a triangulation approach. To facilitate more structured and time efficient risk assessment, the use of evidence maps and scoping reviews during the planning phase of an Opinion is recommended. RoB tools provide a structured way to identify different biases that may occur to varying degrees in different studies. The key elements to capture within each study are the source, magnitude and direction of possible biases. That complexity cannot be accurately captured by assigning a numerical score of study quality, which is therefore discouraged. The type of dose–response modelling for risk or benefit characterisation should be selected based on the type and nature of the available data and the objective aimed for (minimising risk, maximising benefits or balancing the two). Recommendations for further developments 7 RoB tools have a long history of use for RCTs in humans. There is room for further development of these tools to capture the differences of different observational designs and use for other populations (e.g. livestock or companion animals and plants). It is recommended that EFSA collaborates at the European and international levels with relevant organisations and initiatives to harmonise developments in this area. 8 Based on the principles outlined in this document, a guidance on human BMD modelling specifically addressed to modelling of human observational data should be developed. This requires adaptation to the existing methodological framework designed around controlled animal experiments. It would need to be accompanied by changes to existing BMD software platforms to allow for adjustments of multiple covariates and modelling of relative risk. This would allow for more rigorous and consistent use of human data. 9 The use of multivariable regression analysis is recommended to account for covariates/confounders for BMD modelling of data from epidemiological studies. 10 Efforts are needed to provide guidance on the use of UFs and in particular on the MoE approach when using human epidemiological data. 11 The use of other modelling approaches frequently applied in epidemiology such as dose–response meta‐analyses or estimation of thresholds should be explored and developed further for the area of chemical risk assessment. 12 Although design and conduct of epidemiological studies in humans, animals and plants often differ, many similarities exist. Better understanding of those similarities and differences and the terminology used is essential to address cross‐cutting challenges that EFSA will face in the future. This requires training and closer collaboration among experts and staff across panels. GLOSSARY Accuracy The extent to which systematic error (bias) is minimised. Risk of bias addresses also aspects like the sensitivity and specificity of the detection method used in an assessment (also referred to as 'Internal Validity'). Aggregated data Information resulting from the combination of individual data (e.g. mean exposure in a treatment group, standard deviation of the observations in a group, etc.). See Individual data. Assembling the Evidence The first of three basic steps of weight of evidence assessment, as proposed in this guidance. Includes identification of potentially relevant evidence, selection of evidence to include in the weight of evidence assessment and grouping the evidence into lines of evidence. Assessment The term refers to all types of scientific assessments produced in the EFSA context, and for referring to both assessments based on data generated ex novo, assessments based on already existing data or assessments conducted by eliciting expert knowledge. Also referred to as 'scientific assessment'. Best professional judgement A category of weight of evidence assessment methods involving qualitative listing and qualitative integration of multiple pieces or lines of evidence. Case‐specific assessment Case‐specific assessments, where there is no pre‐specified procedure and assessors need to choose and apply weight of evidence approaches on a case‐by‐case basis. Causal criteria A category of weight of evidence assessment methods based on criteria for determining cause and effect relationships. Complementary line of evidence A line of evidence which can only answer a question or sub‐question when it is combined with other line(s) of evidence. Conceptual framework The context of the assessment; all sub‐question(s) that must be answered; and how they combine in the overall assessment. Consistency The extent to which the contributions of different pieces or lines of evidence to answering the specified question are compatible. Critical Appraisal Tool (CAT) Tool for appraising study methodological quality (see definition). A CAT contains a comprehensive list of elements to consider for appraising study methodological quality and detailed guidance for performing the appraisal. CATs are tailored for the specific study designs. For instance, the items to be considered when appraising a randomised controlled trial are different from those considered in an observational study. Within the same study design CATs should be applied by outcome or endpoint. This is because the same study can be of different methodological quality depending on the outcomes that are reported. CATs should be applied to each individual study included in the assessment so to allow a consistent classification of studies according to their methodological quality (which is then considered when assessing the reliability of the evidence they provide). Data A piece of information. See also Individual data and Aggregated data. Ecological studies Studies in which the unit of analysis are populations or groups of people rather than individuals. Conclusions of ecological studies may not apply to individuals, but ecological studies can reach valid inferences on causal relationships at the aggregate/ group (ecological) level. Ecological studies have a role when implementing or evaluating policies that affect entire groups or regions. Emergency assessment Emergency procedures, where the choice of approach is constrained by unusually severe limitations on time and resources. Estimate A calculation or judgement of the approximate value, number, quantity, or extent of something. Some weight of evidence questions refer to estimates, while others refer to hypotheses. Evidence Information that is relevant for assessing the answer to a specified question. In PROMETHEUS, a piece of evidence for an assessment is defined as data (information) that is deemed relevant for the specific objectives of the assessment (EFSA,  2015 ). In this Guidance, this is expanded to all potentially relevant information, i.e. all evidence identified by the initial search process, to recognise that the assessment of relevance in the search process is necessarily a preliminary one (e.g. based on keywords and titles alone). ‘Evidence’ can refer to a single piece of potentially relevant information or to multiple pieces. Ex novo data generation The process of generating new data as it occurs when designing and conducting an experiment or an observational study (e.g. a survey). Sometimes also referred to as 'primary research study' as opposite to a 'secondary research study' based on existing data (i.e. a review). In the EFSA context, studies generating data ex novo are designed and conducted for instance by the applicants submitting a dossier to EFSA in support of an application or by EFSA, when e.g. performing surveys (e.g. baseline surveys). Expert judgement An expert judgement is a judgement made by an expert about a question or consideration in the domain in which they are expert. Such judgements may be qualitative or quantitative, but should always be careful, reasoned, evidence‐based and transparently documented. Extensive Literature Search (ELS) A literature search process structured in a way to identify as many studies relevant to a review question as needed. It is tailored in order to address the trade‐off between sensitivity and specificity depending on the context of the review question. The fundamental characteristics of an ELS are: (1) use of tailored search strings, and (2) extensive use of literature sources (i.e. bibliographic databases and other sources accessed via electronic or hand‐searching – for example, websites, journal tables of content, theses repositories, etc.). External Validity The validity of the inferences as they pertain to participants outside the source population which is either a target or can be argued to experience effects similar to the targets. GRADE An approach for grading the quality of evidence and the strength of recommendations in environmental and occupational health, proposed and developed by the Grading of Recommendations, Assessment, Development and Evaluation (GRADE) Working Group (see Morgan et al.,  2016 ). Hypothesis One type of framing for weight of evidence questions. Defined as a proposition proposed to be a potential explanation of a phenomenon or a potential outcome of a phenomenon. Some weight of evidence questions refer to hypotheses, while others refer to estimates. Individual data Information collected at the level of the finest unit on which variables are measured (e.g. exposure observed on each individual belonging to a study). By definition, they cannot be further 'disaggregated'. Influence analysis A study of possible change in the assessment output resulting not just from uncertainties about inputs to the assessment but also from uncertainties about choices made in the assessment. Integrating the evidence The third of three basic steps of weight of evidence assessment, as proposed in this guidance. Includes developing a conceptual model for integration, assessing the consistency of the evidence, applying the method chosen for integration and developing the weight of evidence conclusion. Internal Validity See accuracy. Line of evidence A set of evidence of similar type Meta‐analysis A statistical analysis that combines the results of multiple scientific studies OHAT An approach to systematic review and evidence integration for literature based environmental health science assessments, developed by the NTP Office of Health Assessment and Translation (OHAT) (see Rooney et al.,  2014 ). Piece of evidence A broad term used to refer to distinct elements of evidence that may be combined to form a line of evidence, e.g. a single study, expert judgement or experience, a model, or even a single observation. Precision The extent to which random error is minimised and the outcome of the approach, method, process or assessment is reproducible over time. Probability Defined depending on philosophical perspective: (1) the frequency with which samples arise within a specified range or for a specified category; (2) quantification of uncertainty as degree of belief regarding the likelihood of a particular range or category. The latter perspective is implied when probability is used in a weight of evidence assessment to express relative support for possible answers Problem formulation In the present guidance, problem formulation refers to the process of clarifying the questions posed by the Terms of Reference, deciding whether and how to subdivide them, and deciding whether they require weight of evidence assessment Qualitative assessment An assessment performed or expressed using words, categories or labels Quantification A category of weight of evidence assessment methods defined as comprising formal decision analysis and statistical methods. Would also include probabilistic reasoning. Quantitative assessment An assessment performed or expressed using a numerical scale (see Rating A category of weight of evidence assessment methods for weighing and/or integration of evidence based on qualitative logic models, ranks, scores and empirical models. Refinement One or more changes to an initial assessment, made with the aim of reducing uncertainty in the answer to a question. Sometimes done as part of a ‘tiered approach’ to risk or benefit assessment. Relative support An expression of the extent to which evidence supports one possible answer to a weight of evidence question, relative to other possible answers. Can be expressed qualitatively or quantitatively. Quantitative expression can be in terms of probability Relevance The contribution a piece or line of evidence would make to answer a specified question, if the information comprising the line of evidence was fully reliable. In other words, how close is the quantity, characteristic or event that the evidence represents to the quantity, characteristic or event that is required in the assessment. This includes biological relevance as well as relevance based on other considerations, e.g. temporal, spatial, chemical, etc. Reliability Reliability of a piece of evidence refers to: (i) precision (see definition); and (ii) accuracy (see definition). It is influenced by the methodological quality of the process for producing such evidence. Representativeness Ability of a subset of a population (e.g. a sample of individuals) to reflect accurately specific characteristics of the population of origin. Scientific assessment See Assessment. Scope of the assessment What is to be evaluated in the assessment. Sensitivity analysis A study of how the variation in the outputs of a model can be attributed to, qualitatively or quantitatively, different sources of uncertainty or variability. Implemented by observing how model output changes when model inputs are changed in a structured way. Standalone line of evidence A line of evidence which offers an answer to a question or sub‐question without needing to be combined with other lines of evidence. Standardised assessment procedures Assessments where the approach to integrating evidence is fully specified in a standardised assessment procedure. They generally include standardised elements that are assumed to provide adequate cover for uncertainty. Sub‐question A scientific question that does not need to be further broken down to be answered and is formulated in a way that is directly answerable in an experiment or observational study (or as a single question in an expert elicitation study). Uncertainty A general term referring to all types of limitations in available knowledge that affect the range and probability of possible answers to an assessment question. Uncertainty analysis A collective term for the processes used to identify, characterise, explain and account for sources of uncertainty. Variability Heterogeneity of values over time, space or different members of a population, including stochastic variability and controllable variability. Weighing In this guidance, weighing refers to the process of assessing the contribution of evidence to answering a weight of evidence question. The basic considerations to be weighed are identified in this guidance as reliability, relevance and consistency of the evidence. Weighing the evidence The second of three basic steps of weight of evidence assessment, as proposed in this guidance. Includes deciding what considerations are relevant for weighing the evidence, deciding on the methods to be used, and applying those methods to weigh the evidence. Weight of evidence The extent to which evidence supports one or more possible answers to a scientific question. Hence ‘weight of evidence methods’ and ‘weight of evidence approach’ refer to ways of assessing relative support for possible answers. Weight of Evidence A function of relevance and reliability. Weight of evidence assessment A process in which evidence is integrated to determine the relative support for possible answers to a scientific question. Weight of evidence conclusion The outcome of a weight of evidence assessment, expressed in terms of relative support for possible answers to the weight of evidence question. Weight of evidence question A question addressed by a weight of evidence assessment. This may be the overall scientific question for an assessment, or a sub‐question that contributes to answering the overall question. Weight of evidence questions may be framed in terms of hypotheses (which are often qualitative) or estimates (quantitative). ABBREVIATIONS ADME absorption, distribution, metabolism and excretion AOP adverse outcome pathway APRIO Agent, Pathway, Receptor, Intervention and Output BMD benchmark dose BMDL benchmark dose lower confidence limit BMI body mass index BMR benchmark response BPA bisphenol A CASP Critical Appraisal Skills Programme CHD coronary heart disease CI confidence interval CVD cardiovascular disease DAGs directed acyclic graphs DRV dietary reference value EKE Expert Knowledge Elicitation EPIQ Evidence‐based Practice for Improving Quality GRADE Grading of Recommendations, Assessment, Development, and Evaluations HBCDD hexabromocyclododecane HBGV health‐based guidance value HDL high‐density lipoprotein HOC health outcome category HR hazard ratio IATA Integrated Approach to Testing and Assessment ICD International Statistical Classification of Diseases and Related Health Problems IPM Integrated Pest Management IQ Intelligence Quotient IRR incidence rate ratio ISPM International Standards for Phytosanitary Measures LoE line of evidence MIE molecular initiating events MoA mode of action MOE margin of exposure NOAEL no observed adverse effect level OR odds ratio PBPK physiologically based pharmacokinetic PFAS perfluoroalkyl substances PFOS perfluorooctane sulfonate PFOA perfluorooctanoic acid PECO Population, Exposure, Comparator, Outcome PICO Population, Intervention, Comparator, Outcome PIT Population, Index Test, Target Condition PLH Plant Health PO Population, Outcome QPRA Quantitative Pest Risk Assessments RCT randomised controlled trials RD risk difference RoB risk of bias RP reference point RR risk ratio SD standard deviation SYRCLE Systematic Review Center for Laboratory animal Experimentation ToR Terms of Reference WoE Weight of Evidence UF uncertainty factor RoB tools have a long history of use for RCTs in humans. There is room for further development of these tools to capture the differences of different observational designs and use for other populations (e.g. livestock or companion animals and plants). It is recommended that EFSA collaborates at the European and international levels with relevant organisations and initiatives to harmonise developments in this area. Based on the principles outlined in this document, a guidance on human BMD modelling specifically addressed to modelling of human observational data should be developed. This requires adaptation to the existing methodological framework designed around controlled animal experiments. It would need to be accompanied by changes to existing BMD software platforms to allow for adjustments of multiple covariates and modelling of relative risk. This would allow for more rigorous and consistent use of human data. The use of multivariable regression analysis is recommended to account for covariates/confounders for BMD modelling of data from epidemiological studies. Efforts are needed to provide guidance on the use of UFs and in particular on the MoE approach when using human epidemiological data. The use of other modelling approaches frequently applied in epidemiology such as dose–response meta‐analyses or estimation of thresholds should be explored and developed further for the area of chemical risk assessment. Although design and conduct of epidemiological studies in humans, animals and plants often differ, many similarities exist. Better understanding of those similarities and differences and the terminology used is essential to address cross‐cutting challenges that EFSA will face in the future. This requires training and closer collaboration among experts and staff across panels. The extent to which systematic error (bias) is minimised. Risk of bias addresses also aspects like the sensitivity and specificity of the detection method used in an assessment (also referred to as 'Internal Validity'). Information resulting from the combination of individual data (e.g. mean exposure in a treatment group, standard deviation of the observations in a group, etc.). See Individual data. The first of three basic steps of weight of evidence assessment, as proposed in this guidance. Includes identification of potentially relevant evidence, selection of evidence to include in the weight of evidence assessment and grouping the evidence into lines of evidence. The term refers to all types of scientific assessments produced in the EFSA context, and for referring to both assessments based on data generated ex novo, assessments based on already existing data or assessments conducted by eliciting expert knowledge. Also referred to as 'scientific assessment'. A category of weight of evidence assessment methods involving qualitative listing and qualitative integration of multiple pieces or lines of evidence. Case‐specific assessments, where there is no pre‐specified procedure and assessors need to choose and apply weight of evidence approaches on a case‐by‐case basis. A category of weight of evidence assessment methods based on criteria for determining cause and effect relationships. A line of evidence which can only answer a question or sub‐question when it is combined with other line(s) of evidence. The context of the assessment; all sub‐question(s) that must be answered; and how they combine in the overall assessment. The extent to which the contributions of different pieces or lines of evidence to answering the specified question are compatible. Tool for appraising study methodological quality (see definition). A CAT contains a comprehensive list of elements to consider for appraising study methodological quality and detailed guidance for performing the appraisal. CATs are tailored for the specific study designs. For instance, the items to be considered when appraising a randomised controlled trial are different from those considered in an observational study. Within the same study design CATs should be applied by outcome or endpoint. This is because the same study can be of different methodological quality depending on the outcomes that are reported. CATs should be applied to each individual study included in the assessment so to allow a consistent classification of studies according to their methodological quality (which is then considered when assessing the reliability of the evidence they provide). A piece of information. See also Individual data and Aggregated data. Studies in which the unit of analysis are populations or groups of people rather than individuals. Conclusions of ecological studies may not apply to individuals, but ecological studies can reach valid inferences on causal relationships at the aggregate/ group (ecological) level. Ecological studies have a role when implementing or evaluating policies that affect entire groups or regions. Emergency procedures, where the choice of approach is constrained by unusually severe limitations on time and resources. A calculation or judgement of the approximate value, number, quantity, or extent of something. Some weight of evidence questions refer to estimates, while others refer to hypotheses. Information that is relevant for assessing the answer to a specified question. In PROMETHEUS, a piece of evidence for an assessment is defined as data (information) that is deemed relevant for the specific objectives of the assessment (EFSA,  2015 ). In this Guidance, this is expanded to all potentially relevant information, i.e. all evidence identified by the initial search process, to recognise that the assessment of relevance in the search process is necessarily a preliminary one (e.g. based on keywords and titles alone). ‘Evidence’ can refer to a single piece of potentially relevant information or to multiple pieces. The process of generating new data as it occurs when designing and conducting an experiment or an observational study (e.g. a survey). Sometimes also referred to as 'primary research study' as opposite to a 'secondary research study' based on existing data (i.e. a review). In the EFSA context, studies generating data ex novo are designed and conducted for instance by the applicants submitting a dossier to EFSA in support of an application or by EFSA, when e.g. performing surveys (e.g. baseline surveys). An expert judgement is a judgement made by an expert about a question or consideration in the domain in which they are expert. Such judgements may be qualitative or quantitative, but should always be careful, reasoned, evidence‐based and transparently documented. A literature search process structured in a way to identify as many studies relevant to a review question as needed. It is tailored in order to address the trade‐off between sensitivity and specificity depending on the context of the review question. The fundamental characteristics of an ELS are: (1) use of tailored search strings, and (2) extensive use of literature sources (i.e. bibliographic databases and other sources accessed via electronic or hand‐searching – for example, websites, journal tables of content, theses repositories, etc.). The validity of the inferences as they pertain to participants outside the source population which is either a target or can be argued to experience effects similar to the targets. An approach for grading the quality of evidence and the strength of recommendations in environmental and occupational health, proposed and developed by the Grading of Recommendations, Assessment, Development and Evaluation (GRADE) Working Group (see Morgan et al.,  2016 ). One type of framing for weight of evidence questions. Defined as a proposition proposed to be a potential explanation of a phenomenon or a potential outcome of a phenomenon. Some weight of evidence questions refer to hypotheses, while others refer to estimates. Information collected at the level of the finest unit on which variables are measured (e.g. exposure observed on each individual belonging to a study). By definition, they cannot be further 'disaggregated'. A study of possible change in the assessment output resulting not just from uncertainties about inputs to the assessment but also from uncertainties about choices made in the assessment. The third of three basic steps of weight of evidence assessment, as proposed in this guidance. Includes developing a conceptual model for integration, assessing the consistency of the evidence, applying the method chosen for integration and developing the weight of evidence conclusion. See accuracy. A set of evidence of similar type A statistical analysis that combines the results of multiple scientific studies An approach to systematic review and evidence integration for literature based environmental health science assessments, developed by the NTP Office of Health Assessment and Translation (OHAT) (see Rooney et al.,  2014 ). A broad term used to refer to distinct elements of evidence that may be combined to form a line of evidence, e.g. a single study, expert judgement or experience, a model, or even a single observation. The extent to which random error is minimised and the outcome of the approach, method, process or assessment is reproducible over time. Defined depending on philosophical perspective: (1) the frequency with which samples arise within a specified range or for a specified category; (2) quantification of uncertainty as degree of belief regarding the likelihood of a particular range or category. The latter perspective is implied when probability is used in a weight of evidence assessment to express relative support for possible answers In the present guidance, problem formulation refers to the process of clarifying the questions posed by the Terms of Reference, deciding whether and how to subdivide them, and deciding whether they require weight of evidence assessment An assessment performed or expressed using words, categories or labels A category of weight of evidence assessment methods defined as comprising formal decision analysis and statistical methods. Would also include probabilistic reasoning. An assessment performed or expressed using a numerical scale (see A category of weight of evidence assessment methods for weighing and/or integration of evidence based on qualitative logic models, ranks, scores and empirical models. One or more changes to an initial assessment, made with the aim of reducing uncertainty in the answer to a question. Sometimes done as part of a ‘tiered approach’ to risk or benefit assessment. An expression of the extent to which evidence supports one possible answer to a weight of evidence question, relative to other possible answers. Can be expressed qualitatively or quantitatively. Quantitative expression can be in terms of probability The contribution a piece or line of evidence would make to answer a specified question, if the information comprising the line of evidence was fully reliable. In other words, how close is the quantity, characteristic or event that the evidence represents to the quantity, characteristic or event that is required in the assessment. This includes biological relevance as well as relevance based on other considerations, e.g. temporal, spatial, chemical, etc. Reliability of a piece of evidence refers to: (i) precision (see definition); and (ii) accuracy (see definition). It is influenced by the methodological quality of the process for producing such evidence. Ability of a subset of a population (e.g. a sample of individuals) to reflect accurately specific characteristics of the population of origin. See Assessment. What is to be evaluated in the assessment. A study of how the variation in the outputs of a model can be attributed to, qualitatively or quantitatively, different sources of uncertainty or variability. Implemented by observing how model output changes when model inputs are changed in a structured way. A line of evidence which offers an answer to a question or sub‐question without needing to be combined with other lines of evidence. Assessments where the approach to integrating evidence is fully specified in a standardised assessment procedure. They generally include standardised elements that are assumed to provide adequate cover for uncertainty. A scientific question that does not need to be further broken down to be answered and is formulated in a way that is directly answerable in an experiment or observational study (or as a single question in an expert elicitation study). A general term referring to all types of limitations in available knowledge that affect the range and probability of possible answers to an assessment question. A collective term for the processes used to identify, characterise, explain and account for sources of uncertainty. Heterogeneity of values over time, space or different members of a population, including stochastic variability and controllable variability. In this guidance, weighing refers to the process of assessing the contribution of evidence to answering a weight of evidence question. The basic considerations to be weighed are identified in this guidance as reliability, relevance and consistency of the evidence. The second of three basic steps of weight of evidence assessment, as proposed in this guidance. Includes deciding what considerations are relevant for weighing the evidence, deciding on the methods to be used, and applying those methods to weigh the evidence. The extent to which evidence supports one or more possible answers to a scientific question. Hence ‘weight of evidence methods’ and ‘weight of evidence approach’ refer to ways of assessing relative support for possible answers. A function of relevance and reliability. A process in which evidence is integrated to determine the relative support for possible answers to a scientific question. The outcome of a weight of evidence assessment, expressed in terms of relative support for possible answers to the weight of evidence question. A question addressed by a weight of evidence assessment. This may be the overall scientific question for an assessment, or a sub‐question that contributes to answering the overall question. Weight of evidence questions may be framed in terms of hypotheses (which are often qualitative) or estimates (quantitative). absorption, distribution, metabolism and excretion adverse outcome pathway Agent, Pathway, Receptor, Intervention and Output benchmark dose benchmark dose lower confidence limit body mass index benchmark response bisphenol A Critical Appraisal Skills Programme coronary heart disease confidence interval cardiovascular disease directed acyclic graphs dietary reference value Expert Knowledge Elicitation Evidence‐based Practice for Improving Quality Grading of Recommendations, Assessment, Development, and Evaluations hexabromocyclododecane health‐based guidance value high‐density lipoprotein health outcome category hazard ratio Integrated Approach to Testing and Assessment International Statistical Classification of Diseases and Related Health Problems Integrated Pest Management Intelligence Quotient incidence rate ratio International Standards for Phytosanitary Measures line of evidence molecular initiating events mode of action margin of exposure no observed adverse effect level odds ratio physiologically based pharmacokinetic perfluoroalkyl substances perfluorooctane sulfonate perfluorooctanoic acid Population, Exposure, Comparator, Outcome Population, Intervention, Comparator, Outcome Population, Index Test, Target Condition Plant Health Population, Outcome Quantitative Pest Risk Assessments randomised controlled trials risk difference risk of bias reference point risk ratio standard deviation Systematic Review Center for Laboratory animal Experimentation Terms of Reference Weight of Evidence uncertainty factor

Question

EFSA‐Q‐2019‐00200

Copyright

EFSA may include images or other content for which it does not hold copyright. In such cases, EFSA indicates the copyright holder and users should seek permission to reproduce the content from the original source.

Requestor

EFSA

Introduction

Epidemiology is the study of the distribution and determinants of health‐related states or events in specified populations, and the application of this study to the control of health problems. Therefore, in the broadest sense, epidemiological studies examine determinants of health and disease conditions in defined populations, including humans, animals and plants. Epidemiological studies include both experimental and non‐experimental studies, the latter is often referred to as ‘observational’ studies. Within EFSA's remit there are well‐established procedures and guidelines covering the use of controlled animal experiments for chemical risk assessment and the use of double blind randomised controlled trials. Other sources of evidence include non‐experimental studies for assessing potential harm or benefits of different factors (chemicals, nutrients, biohazards) in humans, animals, including both analytical and descriptive monitoring studies, and nutritional intervention studies that deviate from randomised controlled trial (RCT) designs. For these sources of evidence, guidance on how to use these studies in EFSA's work is either more limited or lacking. This is particularly the case for epidemiological studies in humans, which are often characterised by high variability and uncertainties related to ethical constraints on what interventions can be made and how information can be collected. Therefore, the way human epidemiological studies are conducted and the information they provide do not always fit into existing frameworks for traditional chemical risk assessment or other established procedures within EFSA. In light of the identified needs, it is important that clear guidance be developed on how evidence from epidemiological studies can be appraised, integrated and used in EFSA's scientific assessments. Such guidance would enable all areas in EFSA's remit to better exploit all sources of evidence, while correctly accounting for their potential limitations. The Scientific Committee has recommended in 2013 and in 2016 that a cross‐cutting guidance be developed on the appraisal and use of epidemiological studies. This recommendation was based on the observation that limited use is made of evidence from non‐experimental studies in chemical risk assessment. This project will: A . deliver a Guidance addressing the following terms of reference (ToRs): Set the basis for giving guidance on how to appraise and interpret findings from different types of epidemiological evidence and its application in EFSA scientific assessments. Provide guidance on how to appraise and integrate evidence from epidemiological studies of humans or animals for specific scientific assessment questions of the different EFSA panels. Particular emphasis should be given to areas where guidance is lacking. Provide guidance on how to use evidence from epidemiological studies in EFSA scientific assessments. Set the basis for giving guidance on how to appraise and interpret findings from different types of epidemiological evidence and its application in EFSA scientific assessments. Provide guidance on how to appraise and integrate evidence from epidemiological studies of humans or animals for specific scientific assessment questions of the different EFSA panels. Particular emphasis should be given to areas where guidance is lacking. Provide guidance on how to use evidence from epidemiological studies in EFSA scientific assessments. Particularly, In relation to safety of chemicals for human health: Provide guidance on how to appraise and use epidemiological evidence from experimental and non‐experimental human studies in scientific assessments of chemicals. In relation to efficacy assessment for human health, animal and plant health: Provide guidance on how to appraise and use epidemiological evidence from experimental and non‐experimental human and animal studies in scientific assessment of efficacy for chemical and biological agents. In relation to safety of chemicals for human health: Provide guidance on how to appraise and use epidemiological evidence from experimental and non‐experimental human studies in scientific assessments of chemicals. In relation to efficacy assessment for human health, animal and plant health: Provide guidance on how to appraise and use epidemiological evidence from experimental and non‐experimental human and animal studies in scientific assessment of efficacy for chemical and biological agents. B . Facilitate the implementation of the Guidance in EFSA's scientific assessments by providing: Info sessions. Trainings. Assistance from a cross‐cutting WG (to be agreed at the adoption of the guidance). Info sessions. Trainings. Assistance from a cross‐cutting WG (to be agreed at the adoption of the guidance). The use of epidemiological studies affects risk assessment in a broad range of areas that fall under EFSA's remit, e.g. nutrition, toxicology, animal and plant health as well as biological hazards. It requires a good understanding of the strengths and limitations of different study designs, and the ability to evaluate studies individually, in a structured manner, and to assimilate and interpret evidence from relevant epidemiological studies. This guidance will provide a brief introduction to the different types of epidemiological studies (Section  4.1 ) and explain the key epidemiological concepts that are relevant for evidence appraisal (Section  4.2 ) to address the first ToR . To address the second ToR , the guidance will explain (1) the three main types of bias, i.e. information bias, selection bias and confounding, in specific study designs; (2) how to make judgements about the direction and the magnitude of the bias; and (3) how to deal with bias when it is identified in a study. Further, different approaches to study appraisal will be described. Finally, the guidance will explain the integration of evidence across different types of epidemiological studies (Section  4.3 ) within the same population (e.g. summary across all human observational studies or across all human experimental studies) and highlight the respective value of the available study designs in the context of the research question and the population under study. The use of epidemiological studies, e.g. for establishing reference values (e.g. health‐based guidance values (HBGVs) such as acceptable daily intake or tolerable daily intake) varies among different panels and depends to a large extent on their different types of scientific assessments (nutrition, toxicology, animal and plant health). Therefore, Section  4.4 , addressing the third ToR , will focus on the panel‐specific needs regarding use of epidemiological studies. It will also provide guidance on specific issues regarding evidence integration. In terms of scope, this guidance will cover the appraisal of experimental and observational epidemiological studies, giving particular emphasis on studies with humans as target populations. The particularities of experimental studies in animals (livestock, companion animals) and plants are briefly explained in Sections  4.1.3 and 4.1.4 . The appraisal of evidence from studies of laboratory animals and in‐vitro studies, as well as guidance on plant or animal disease epidemiology are outside the scope of this guidance.

Coi Statement

If you wish to access the declaration of interests of any expert contributing to an EFSA scientific assessment, please contact [email protected] .

Supplementary Material

Report of the Public consultation on the draft guidance of EFSA's Scientific Committee on appraising and integrating evidence from epidemiological studies for use in EFSA's scientific assessments

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: pmc-nxml

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-06-26T06:14:25.090378+00:00
unpaywall
last seen: 2026-05-21T05:10:58.409756+00:00
License: CC-BY-ND-4.0