Intro
Childhood overweight/obesity prevalence globally increased from 8% in 1990 to 20% in 2022 [ 1 ] and the rate of increase has been higher than that of adult obesity [ 2 ]. Overweight/obesity in childhood increases the risk of adult obesity, which increases the risk of developing long-term conditions such as cardiovascular disease and diabetes [ 3 ]. Data from five UK birth cohorts between 1946 and 2001 showed a trend towards earlier onset of obesity in more recent cohorts with two to three times greater estimated probability of overweight/obesity in cohorts born after the 1980s compared to before [ 4 ]. In the 2022/2023 academic year, the prevalence of childhood overweight/obesity at the start of primary school was 24.8% in Wales [ 5 ] and 21.4% in England [ 6 ]. Children living in more deprived areas are twice as likely to have obesity than those living in less deprived areas [ 5 , 6 ]. Low socioeconomic position was associated with higher weight in the 2001 UK Millennium birth cohort, with inequalities widening from childhood to adolescence [ 7 ].
Analysis of data from 6066 children from a UK birth cohort found that 75% of children with obesity at 7 years remained with obesity at 11 years and 16% of those with overweight at 7 years developed obesity and 20% returned to healthy weight at 11 years [ 8 ]. A longitudinal study of 5863 pre-adolescent children found that those of a healthy weight remained a healthy weight during adolescence but few children with overweight/obesity reduced to healthy weight with little evidence of new cases of overweight/obesity emerging during adolescence [ 9 ]. A meta-regression of 48 studies found a high degree of tracking of body mass index (BMI) over time regardless of age of BMI measurement but was strongest in the pubertal period (10–14 years) and adulthood [ 10 ]. Adults with obesity were at higher risk of developing two or more long-term conditions (five times higher for two and 12 times higher for four or more conditions) and developed conditions earlier than adults of healthy weight [ 11 ]. Globally in 2015, high BMI contributed to 4.9% of disability-adjusted life years from any cause with over a third of high BMI related disability-adjusted life years in people with BMI <30 kg/m 2 [ 2 ]. However, decrease in adiposity into adulthood has been shown to be associated with reduced and similar risks to those with consistently normal BMI through childhood [ 3 , 12 ].
There is currently no early identification system during pregnancy or early-life to detect those at high risk of childhood obesity. As part of the Studying Lifecourse Obesity PrEdictors (SLOPE) study, we utilized anonymized routinely collected antenatal and birth records linked to child health records for births in Hampshire, England, between 2003 and 2013 to develop prediction models for the risk of childhood overweight/obesity. Models were developed at stages corresponding to statutory healthcare visits for women and children in England (first antenatal booking appointment, birth, child aged around 1 year and around 2 years) [ 13 ]. Models were internally and externally validated [ 13 , 14 ]. Maternal factors from pregnancy included in the model remained consistent across the stages but model performance improved across the stages from 0.66 at booking appointment to 0.83 at child age around 2 years. Stakeholders at a workshop conducted as part of this project suggested that family-focused interventions were important to prioritize in early childhood as this may provide an opportunity to change habits and trajectories before these become more established [ 15 ]. The transition to primary school is an important time with positive experiences during this time linked to improved social, emotional, and educational outcomes [ 16 ] which in turn was linked to subsequent mental health and weight in adolescence [ 17 ].
In this study, we aimed to explore how well childhood obesity at school entrance age can be predicted using healthcare and wider demographic, socioeconomic and area-level data using linked health and administrative data in another UK country: Wales. We also aimed to map the predictor variables across pre-conceptualized early-life domains [ 18 ].
Methods
This analysis is part of the Multidisciplinary Ecosystem to study Lifecourse Determinants and Prevention of Early-onset Burdensome Multimorbidity (MELD-B) project [ 19 ] examining the role of early-life factors as childhood obesity can act as a mediator on the pathway to multiple long-term conditions [ 20 ]. As part of MELD-B, we previously conceptually identified 12 domains of early life factors which may be important for multimorbidity risk [ 18 ], and we explored how these conceptualized domains are represented in the variables considered for prediction.
The Secure Anonymized Information Linkage (SAIL) Databank ( www.saildatabank ) holds anonymized, routinely collected individual-level data for all Welsh residents using National Health Service (NHS) Wales. Each individual is assigned a unique identifier called an anonymized linking field (ALF) to ensure anonymity and confidentiality while enabling individual level data linkage across different datasources.
We used the SAIL MELD-B children and Young adults e-cohort (SMYC). Details of the datasources used are described in detail elsewhere [ 21 ]. The SMYC cohort was restricted to individuals born between 1 January 2000 and 31 December 2022 with both residency and health data available before 18 years of age. Individuals had to be linked to a consistent maternal record (linked to at most one mother) to be included in SMYC. This led to a sample of 896 155 individuals.
For this analysis, we further restricted to those with an outcome measurement in the National Community Child Health (NCCH) data between 4 and 5 years (measurements from the Child Measurement Programme) and with a mother identifier as we intended to use maternal variables.
We created two subsamples. The first was restricted to singleton births between 15 March 2010 and 28 March 2012 to enable inclusion of administrative data from the 2011 Census [Office for National Statistics (ONS) 2011 Census Wales]. This was a year before and after the 2011 Census date and allowed for analysis of additional factors only available in the Census. Early-life data were not available for this period.
The second subsample was restricted to singleton births between 2014 and 2018 to enable inclusion of child weight measurements from early-life recorded as part of statutory health checks around 6, 15, and 27 months. The sample was restricted to births up to 2018 to allow inclusion of information at 4–5 years.
The Child Measurement Programme (CMP) for Wales measures the height and weight of children in Reception (primary school entry, age 4–5 years). Measurements are carried out by school nursing teams.
BMI was calculated as weight/height 2 and converted to age- and sex-adjusted BMI z-scores according to UK 1990 growth reference charts [ 22 ]. The outcome of childhood overweight/obesity was specified using the cut-off of 91st percentile ( z -score +1.33). This cut-off is used for national guidance on clinical management of childhood overweight in the UK [ 23 , 24 ].
Maternal age (in years) at pregnancy was recorded. Maternal BMI was calculated using weight and height, available at the first antenatal booking appointment in the Maternity Indicators Dataset (MIDS). For mothers with missing data on height and/or weight or no record in MIDS, we linked to adult BMI recorded in other datasources using the harmonization methodology developed at Swansea University [ 25 ]. Maternal BMI records between 1 year before conception to child’s date of birth were extracted. If more than one weight/height record was available during this period, a priority order based on the timing of measurement was used: conception to 12 weeks, 0–3 months preconception, 3–6 months preconception, 6–9 months preconception, 9–12 months preconception, and 12–24 weeks of pregnancy.
Smoking during pregnancy was recorded in maternity and child health data sources as non-smoker, gave up during pregnancy, smoker and smoker in household. Records of mothers with missing data on smoking during pregnancy were linked to GP records between one year before conception and child’s date of birth. Detailed codes were used to record smoking which was condensed to non-smoker, current smoker, ex-smoker, and stopped smoking. The priority order for repeat smoking records was: conception to 12 weeks, 12 weeks of pregnancy-birth, 0–3 months preconception, 3–6 months preconception, 6–9 months preconception, and 9–12 months preconception. Smoking status was condensed to smoker and non-smoker due to differences in recording between the different sources.
Parity was categorized as 0, 1, 2, and 3 or more. Marriage indicator at birth registration was used to derive marital status as married or living with partner, not married, and partner at different address, and single parent. Marital status from the Census was used if marriage indicator at birth registration was missing.
We linked to maternal GP records to identify the following pre-existing diagnoses: anaemia, asthma, coronary heart disease, type 1 and type 2 diabetes, endometriosis, epilepsy, hyperthyroidism, hypothyroidism, irritable bowel disease, learning disability, polycystic ovarian syndrome, venous thromboembolism, and mental health conditions. These conditions were identified in a review of guidelines, recommendations and policy reports as important indicators of preconception health [ 26 ]. Depression, anxiety, stress, and anaemia were restricted to diagnosis within 5 years preconception as deemed more relevant to the pregnancy and due to the possibility of experiencing these for a shorter period. Maternal mental health conditions was categorized as anxiety, depression or stress and severe mental health conditions (bi-polar, schizophrenia, psychotic depression, puerperal psychosis, psychosis).
Maternal educational attainment was categorized as no qualifications, O levels or equivalent, A levels or equivalent, degree/higher or equivalent and foreign qualifications. Parents country of birth, number of cars in household, unpaid carer in household, parent disability affecting activities, parent health, number of people per bedroom and main household language were self-reported.
Birthweight, gestational age at birth and mode of birth were recorded at birth. Breastfeeding was recorded at birth, 7 days, 10 days, 6–8 weeks, and 6 months, and age in weeks when breastfeeding stopped. This was condensed to breastfeeding at birth (point with most data recorded). Ethnic group was recorded as White, Asian, Mixed, Black, and Other. Welsh Index of Multiple Deprivation (WIMD) for the child was linked using the closest record to birth up to a maximum of 6 months after birth. Urban/rural area of mother’s residence was from the birth registration record.
Breastfeeding recorded at 6–8 weeks and 6 months and age in weeks when breastfeeding stopped was used to derive breastfeeding at 6–8 weeks. Child weight was recorded at 6, 15, and 27 months as part of statutory health visits. Weights recorded within 2 months before or after the statutory visit age were used to calculate weight, e.g. weight recorded between 4 and 8 months was classified as the 6 month visit weight if only one measurement was recorded during this period.
All analysis were performed using Stata [ 27 ]. We adjusted for clustering by mother by including cluster-robust standard errors as some women had more than one pregnancy in the dataset. The variables with more than 5% missingness in the census subsample were maternal BMI (73.5%), maternal smoking (23.6%), maternal mental and physical health conditions (33.2%), mode of birth (29.7%), and parity (16.4%). Similarly in the early-life subsample, these were early-life weight (71.3%–84.9% depending on age at measurement), maternal BMI (37.5%), pre-existing physical health condition (31.8%), mental health condition (23.2%), breastfeeding at 8 weeks (19.0%), mode of birth (16.5%), maternal smoking (12.9%), and parity (9.1%). Multiple imputation by chained equations (MICE) was carried out using truncated regression for continuous variable and predictive mean matching for categorical variables. Missing predictor values were imputed in the sample with outcome of interest. We carried out 75 imputations of the sample generating 75 imputed datasets based on the percentage of missing data in the sample [ 28 ].
Stepwise backward elimination was used to select the variables to be included in the model [ 29 ]. Variables are removed sequentially from the model if P values for a variable exceeds the significance level specified at .157 (equivalent to the Akaike information criteria) [ 30 ] to reduce the risk of overfitting. Models were developed using logistic regression with overweight/obesity fitted as a binary outcome. Non-linear relationships between continuous candidate predictors and outcome were investigated using fractional polynomials. Events per variable was used to ensure the sample size was sufficient (based on a rule of thumb of 20 events per variable) [ 31 , 32 ].
Models were developed in stages. For the census subsample, first including factors in healthcare records and then factors in the census and healthcare records, both incorporating data available at birth. For the early-life subsample, first incorporating data collected at birth, then child aged 6 months, aged 15 months, and aged 27 months.
Model performance was assessed using discrimination (measure of how well the model differentiates between individuals) and calibration (agreement of predicted outcome of the model with the observed outcome on average). The area under (receiver operating characteristic) curve (AUC) was used to summarize the overall discriminatory ability of the models. The AUC was classified as: 0.6–0.7 poor, 0.7–0.8 fair, 0.8–0.9 good, and 0.9–1.0 excellent. The calibration slope was calculated for all models.
Heuristic shrinkage factors were calculated for each model [ 33 ] and the regression coefficients from the models were multiplied by the shrinkage factor to adjust for optimism.
Ethics approval for the MELD-B project was granted by the University of Southampton Faculty of Medicine Ethics committee (ERGO 66810). Approval for the use of anonymized data in this study, provisioned within the SAIL Databank, was granted by an independent Information Governance Review Panel (IGRP) under project 1377.
Results
Supplementary Figure S1 shows the eligible sample. 12.4% of 53 815 children in the census subsample and 13.6% of 60 990 children in the early-life subsample had BMI ≥91st centile at 4–5 years. Baseline characteristics for the samples are summarized in Table 1 and were similar for both subsamples. Mean maternal age at pregnancy was 28.5 years (SD 5.9) and mean maternal BMI was 26.3 kg/m 2 (SD 6.2) in the census subsample. Over a fifth (23% in the census subsample, 20.2% in the early-life subsample) of mothers were categorized as smokers. Over half (57.4% in census subsample, 55.4% in early-life subsample) of mothers reported breastfeeding at birth.
Summary of factors and outcome for the sample using the multiple imputed data
Factors retained in the healthcare factors model using the census subsample included maternal variables: age, BMI, smoking, parity, ethnic group, marital status, anaemia, venous thromboembolism, and child variables: birthweight, gestational age at birth, gender and breastfeeding at birth ( Table 2 ). Additional variables retained in the healthcare factors and census model included: unpaid carer, maternal education attainment, mother’s type of area of residence (urban/rural) and WIMD of child’s residence.
Intercept and regression coefficients of the prediction models for overweight and obesity (91st centile) in children aged 4–5 years in the census subsample using multiple imputed data ( n = 53 815)
Only predictors significant at P < .157 were included in the models. Categorical variables with at least one significant category have been included.
Factors retained in the models using the early-life subsample were similar ( Table 3 ). The same variables were retained in both census subsample using healthcare factors only and the early-life subsample at birth with the exception of mother’s pre-existing conditions. The census subsample included anaemia and venous thromboembolism whereas the early-life subsample model included coronary heart disease and hypothyroidism (at birth only) and polycystic ovarian syndrome (at birth and early-life). Child weight during early-life was retained in the early- life models.
Intercept and regression coefficients of the prediction models for overweight and obesity (91st centile) in children aged 4–5 years in the early-life subsample using multiple imputed data ( n = 60 990)
Only predictors significant at P < .157 were included in the prediction models. Categorical variables with at least one significant category have been included.
Discrimination (AUC) in the census subsample was 0.66 in the healthcare factors model, and 0.67 in the combined model with healthcare and census factors. Discrimination (AUC) in the early-life subsample was 0.67 in the birth model increasing to 0.79 in the model at ∼27months. Calibration slopes of all the models were close to 1 (0.92–1.00) indicating good calibration with slight overprediction in some models. Variables entered into the model mapped to the prenatal, antenatal, neonatal, and birth domain; demographic; transgenerational impact of parent health and health behaviours, socioeconomic; and neighbourhood, physical environment and health care systems ( Table 4 ). Variables from all domains were retained in the model. All variables entered from the prenatal, antenatal, neonatal and birth and neighbourhood, physical environment, and health care systems were retained.
Mapping of predictor variables to pre-conceptualized domains of early-life risk [ 18 ].
Conditional on other variables in the model, maternal BMI was the strongest predictor in the model. Birthweight was also a strong predictor in the models at birth but less so when weight during early-life was included.
Discussion
We developed models for predicting risk of childhood overweight/obesity at 4–5 years using linked healthcare and administrative data sources in Wales. Predictors retained from healthcare records were largely consistent with previously developed models in England [ 13 ]. However, additional census and area-level variables were retained though increase in model discrimination was marginal (0.66–0.67). In the early-life subsample, model discrimination increased from 0.67 at birth to 0.79 when incorporating child factors from early-life. Model discrimination at birth is poor but the maternal and birth factors are consistent across the stages which allows for early identification with more precise estimation as the child grows. Maternal BMI and birthweight are strong predictors and consistent across model stages. Factors from all domains that variables were mapped to were retained in the model.
The prediction model previously developed as part of the SLOPE study using healthcare data in Hampshire, England had a higher AUC using factors at birth (0.69) compared to these models developed using SAIL data (0.66 in census subsample using healthcare records only, 0.67 in early-life subsample) [ 13 ]. There were differences in variables available across the two datasets and how some variables were categorized. For example, breastfeeding at birth could not be included in model development in SLOPE due to high percentage of missingness but was retained in the SAIL models. Maternal smoking was available in both datasets but was categorized as never smoked, ex- and current smoker in SLOPE and non-smoker and smoker in SAIL so the non-smoker category could have included both never- and ex-smokers. Partnership status in SLOPE was a binary (yes/no) variable but had more detail in SAIL regarding whether mothers with a partner were living with the partner or not. Differences in included factors, definitions of factors and population characteristics can affect predictive performance [ 34 ] which may explain the lower AUC in the SAIL models.
Common predictors across the birth model in SLOPE and both SAIL subsamples were maternal age, maternal BMI, maternal smoking, ethnicity, parity, partnership status, birthweight, gestational age at birth and gender. Maternal educational attainment, intake of folic acid supplements and English as first language were available in SLOPE and retained as predictors in the model. Maternal educational attainment and English/Welsh as household first language was available in the Census data but only maternal education attainment was retained in the SAIL models.
Other variables retained in the SAIL model using the census subsample were if there were unpaid carers in the household and whether carer was working, WIMD, mother’s area type of residence at birth and maternal pre-existing health conditions (anaemia, venous thromboembolism). This demonstrates that the factors available in healthcare records such as maternal age, BMI, ethnicity, smoking status, parity and birthweight which are established childhood obesity risk factors are strong predictors.
In the SLOPE models, we found that discrimination improved when we added child weight at around 1 year (0.78) and 2 years (0.83), which are the points with statutory healthcare checks in England. These occur at slightly different ages in Wales—6, 15, and 27 months–and the AUC for models at these stages using the early-life subsample in SAIL followed a similar pattern (0.72 at 6, 0.78 at 15, and 0.79 at 27 months). Consistency of factors across the model stages means that high-risk groups could be identified early with more precise risk estimation as child grows.
The marginal increase in model discrimination on adding census factors implies that healthcare factors are more important for predicting risk of childhood overweight/obesity. This is in line with the literature with existing prediction models for overweight/obesity commonly including factors such as maternal BMI and birthweight, and health related behaviours such as maternal smoking during pregnancy [ 13 , 35 , 36 ]. Some models also include demographic data such as partnership status, maternal educational attainment or employment status, either from healthcare records [ 13 ] or using cohort data [ 28 ]. Household and area-level factors are less considered despite being wider determinants that contribute to risk which may be due to lack of recording in healthcare records or cohorts around early-life. Factors available in healthcare records usually include some measure of socioeconomic status and it is likely that socioeconomic status variables are correlated so adequate prediction may be achieved using the variables available in healthcare records.
Previous research has shown that early-life factors are associated with long-term health outcomes [ 37 ] and several early-life domains were found to be predictors of health outcomes in adulthood [ 38 , 39 ]. Factors from all available domains were retained in the model in this analysis with all factors from the prenatal, antenatal, neonatal, and birth domain being retained. Prevention policy choices based on domains may help address wider determinants of health. For example, socioeconomic circumstances likely play an important role in the amount of resource available to parents towards achieving/maintaining a healthy weight, particularly with rising child poverty and food insecurity in England [ 40 ]. Risk stratification may help direct resources like healthy eating family vouchers or other financial or social support interventions towards those at highest risk.
A strength of this analysis is the population-based sample which enhances generalizability. We used robust statistical methods to develop the models (retained continuous variables as continuous by investigating variable transformations using multivariable fractional polynomials and calculated model shrinkage) and assess model discrimination. Routinely collected healthcare data was linked to administrative records which allowed to consider the household and area-level factors in predicting the risk of childhood overweight/obesity. Linking to other data sources in SAIL helped reduce missingness in some key factors but the high percentage of missing data led to some loss of information such as fewer classifications for maternal smoking during pregnancy. We additionally carried out multiple imputation of missing data enabling more robust analysis but requires caution in interpretation due to the number of imputations required. The use of BMI is a limitation as it does not distinguish between fat mass and fat-free mass, however age- and sex-adjusted BMI is recommended as a practical estimate of adiposity in children and young people to identify overweight and obesity.
Childhood obesity can act as a mediator on the pathway to multiple long-term conditions. Risk identification tools may be beneficial to target early prevention during antenatal and early-life care. The use of factors available in healthcare records makes the implementation of the risk prediction easier in healthcare settings. Risk could be quantified at birth as most maternal factors remained consistent across models with more precise estimation in early years.
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.