Results
A total of 329 participants enrolled in the study. Among them, 12 participants withdrew from the study with no data contributed, 49 participants did not provide any menstrual cycle data for analysis, and 8 participants’ data were excluded based on the data exclusion criteria. Two hundred and sixty participants, providing 889 cycles with a positive LH test, were included in the analysis of retrospective ovulation day estimates and period start day predictions. The mean age was 32.6 ± 8.0 years of age (range 16–58 years). The mean BMI was 27.5 ± 5.9. 25.8% self-identified as Black, 30.0% White, non-Hispanic, 23.5% Asian, 21.5% Hispanic, 1.5% American Indian, and the remaining as mixed, Pacific Islander, or Middle Eastern. Half of the participants (50%) contributed five or more menstrual cycles ( Table 1 ). One hundred and forty-two had typical cycle lengths in all cycles, and 118 had some atypical cycle lengths ( Table 1 , Supplementary Table S1 ).
Demographics for all participants, those with typical cycle lengths and those with any atypical cycle lengths.
Participants with typical cycle lengths had all cycle lengths between 23 and 35 days. Participants with atypical cycle lengths had at least one short (10–22 days) or long (36–57 days) cycle.
We evaluated the performance of Algorithm 1 in ongoing menstrual cycles (Algorithm 1). For ongoing cycles where the wrist temperature value changed by at least ≥0.2°C, representing a temperature signal commonly associated with ovulation, Algorithm 1 provided a retrospective ovulation estimate in 80.5% (MAE 1.59 days, 95% CI 1.45, 1.74) of ongoing cycles that had a positive LH test, and 80.0% of the estimates (PAE2) had errors within ±2 days. The distribution of retrospective ovulation estimate errors for ongoing cycles is shown in Fig. 1 . When all temperature signals were evaluated for ongoing cycles, Algorithm 1 provided a retrospective ovulation estimate in 70.8% of cycles (MAE 1.91 days, MAE 95% CI 1.73, 2.09, PAE2 73.8%) ( Table 2 ). The percentage of cycles with retrospective ovulation estimates in ongoing cycles with at least a ≥0.2°C temperature shift was similar when those with typical cycle lengths (81.9%, MAE 1.53, 95% CI 1.35, 1.70, PAE2 81.1%) were compared to those with atypical cycle lengths (77.7%, MAE 1.71, 95% CI 1.42, 2.01, PAE2 77.5%) ( Table 2 ). Retrospective ovulation day estimates were generated by the algorithm in 8% of ongoing cycles where only temperature noise was provided as an input ( Table 3 ). The median latency of the algorithm providing a retrospective ovulation estimate was 4 days ( Table 2 ). Based on permutation test, Algorithm 1 had significantly lower MAE than calendar method (MAE of Algorithm 1: 1.91, MAE of calendar method: 2.35, P < 0.001). Using the permutation test, Algorithm 1 had a significantly higher PAE2 than the calendar method (PAE2 Algorithm 1: 74%, PAE2 of calendar: 63%, P < 0.001) ( Supplementary Table S2 ).
Menstrual cycle prediction in all cycles. Histogram of ongoing menstrual cycles ovulation estimate errors (left, Algorithm 1), completed menstrual cycles ovulation estimate errors (middle, Algorithm 2), and menstrual cycle start prediction errors (right, Algorithm 3) from 889 cycles from 216 participants. 73.8% of ongoing menstrual cycles had retrospective ovulation estimates with error within ±2 days, 81.1% of complete menstrual cycles had retrospective ovulation estimates with error within ±2 days, and 81.1% of menstrual cycle start predictions had error within ±3 days.
Ongoing menstrual cycles (Algorithm 1).
The table evaluated for ability of algorithm to retrospectively estimate ovulation day using wrist temperature as compared to labeling with LH testing. Participants with typical cycle lengths had all cycle lengths between 23 and 35 days. Participants with atypical cycle lengths had at least one short (10–22 days) or long (36–57 days) cycle.
ME, mean error; MAE, mean absolute error; PAE2, percentage of absolute error within 2 days.
Unit of ME, MAE, 95% CI, and latency is day.
Retrospective ovulation day estimate in cycles without a positive LH test result.
Temperature noise reflects the algorithm (Algorithm 1, Algorithm 2) estimate of ovulation from temperature noise without any actual temperature signal.
We evaluated the next algorithm’s performance in completed menstrual cycles (Algorithm 2). For completed cycles, retrospective ovulation day was estimated in 80.8% (MAE 1.22 days, 95% CI 1.11, 1.33) of cycles where the wrist temperature value changed by at least ≥0.2°C, and 89.0% of the estimates (PAE2) had errors within ±2 days. The distribution of retrospective ovulation estimate errors for completed cycles is shown in Fig. 1 . When all temperature signals were evaluated for completed cycles, Algorithm 2 provided a retrospective ovulation estimate in 70.9% of cycles (MAE 1.66 days, 95% CI 1.49, 1.83, PAE2 81.1%) ( Table 4 ). The percentage of completed cycles with retrospective ovulation estimates with at least a 0.2°C temperature shift was similar when those with typical cycle lengths (81.7%, MAE 1.17, 95% CI 1.05, 1.30, PAE2 89.6%) were compared to those with atypical cycle lengths (79.0%, MAE 1.32, 95% CI 1.13, 1.53, PAE2 87.8%) ( Table 2 ). Retrospective ovulation day estimates were generated by the algorithm in 15% of completed cycles where only temperature noise was provided as an input ( Table 4 ). Based on the permutation test, Algorithm 2 had significantly lower MAE than calendar method (MAE of Algorithm 1: 1.66, MAE of calendar method: 2.03, P = 0.003). Using the permutation test, Algorithm 2 had a significantly higher PAE2 than calendar method (PAE2 of Algorithm 1: 81%, PAE2 of calendar method: 71%, P = 0.002) ( Supplementary Table S3 ).
Completed menstrual cycles.
The table evaluated for ability of algorithm to retrospectively estimate ovulation day using wrist temperature as compared to labeling with LH testing. Participants with typical cycle lengths had all cycle lengths between 23 and 35 days. Participants with atypical cycle lengths had at least one short (10–22 days) or long (36–57 days) cycle (Algorithm 2).
ME, mean error; MAE, mean absolute error; PAE2, percentage of absolute error within 2 days.
Unit of ME, MAE, and 95% CI is day.
Evaluation of the last algorithm’s performance was for the next menses start day prediction (Algorithm 3). Prediction of menstrual start day when the wrist temperature signal was ≥0.2°C demonstrated an ME of −0.02 days (95% CI −0.22, 0.18) with an MAE of 1.65 days (95% CI 1.52, 1.79), with 89.4% of the predictions being within ±3 days of menses start (PAE3) ( Table 4 ). When all temperature signals were evaluated, Algorithm 3 provided an ME of −0.06 days (95% CI −0.28, 0.14) with an MAE of 1.70 days (95% CI 1.57, 1.84), and 88.4% of all predictions being within ±3 days of menses start (PAE3) ( Table 5 ). The distribution of menstrual cycle start prediction errors for those with typical and atypical cycle lengths is shown in Figs. 2 and 3 . For cycles with a ≥0.2°C wrist temperature shift, the MAE for participants with typical cycle lengths was 1.49 days (95% CI 1.36, 1.63, PAE3 91.2%), and was 1.97 days (95% CI 1.13, 2.24, PAE3 85.4%) for atypical cycle lengths ( Table 5 ). Based on the permutation test, Algorithm 3 had significantly lower MAE than calendar method (MAE of Algorithm 3: 1.70, MAE of calendar method: 1.90, P < 0.001). Using the permutation test, Algorithm 3 had a significantly higher PAE3 than calendar method (PAE3 of Algorithm 3: 88%, PAE3 of calendar: 84%, P < 0.001) ( Supplementary Table S4 ).
Menstrual cycle prediction in typical length cycles. Histogram of ongoing menstrual cycles ovulation estimate errors (left, Algorithm 1), completed menstrual cycles ovulation estimate errors (middle, Algorithm 2), and menstrual cycle start prediction errors (right, Algorithm 3) from all 569 cycles of 126 participants with all typical cycle lengths. Participants with typical cycle lengths had all cycle lengths between 23 and 35 days.
Menstrual cycle prediction in atypical cycles. Histogram of ongoing menstrual cycles ovulation estimate errors (left, Algorithm 1), completed menstrual cycles ovulation estimate errors (middle, Algorithm 2), and menstrual cycle start prediction errors (right, Algorithm 3) from all 320 cycles of 90 participants with atypical cycle length. Participants with atypical cycle lengths had at least one short (10–22 days) or long (36–57 days) cycle.
Menstrual cycle start.
The table evaluated for ability of algorithm to predict future menstrual cycle start day using wrist temperature as compared to logged menstrual flow. Participants with typical cycle lengths had all cycle lengths between 23 and 35 days. Participants with atypical cycle lengths had at least one short (10–22 days) or long (36–57 days) cycle (Algorithm 3).
ME, mean error; MAE, mean absolute error; PAE3, percentage of absolute error within 3 days.
Unit of ME, MAE, and 95% CI is day.
Algorithm 1 using BBT provided a retrospective ovulation estimate in 64.2% (MAE 1.81 days, 95% CI 1.59, 2.01) of cycles ( Table 2 , Supplementary Table S5 ; Vollman, 1977 ; Moghissi, 1980 ; Royston and Abrams, 1980 ). Algorithm 2 using BBT provided a retrospective ovulation day estimate in 69.3% of cycles (MAE 1.53, 95% CI 1.40, 1.66) ( Table 4 , Supplementary Table S6 ). BBT as the input temperature for Algorithm 3 provided a similar MAE (1.72 days, 95% CI 1.59, 1.87) with the BBT temperature signal was ≥0.2°C ( Supplementary Table S7 ).
Materials
We conducted a prospective cohort study of menstruating females, aged 14 and above, who resided in the USA from June 2021 through May 2022 and had an iPhone for use during the study. Participants were excluded if they were using hormones (e.g. oral contraception), had discontinued hormonal contraception in the last 2 months, were pregnant or became pregnant, were lactating, had not had at least eight menstruations in the last year and one menstruation in the last 2 months, had a cancer diagnosis in the past year, had surgery in the past 2 months, or had a scar or tattoo on the dorsal wrist(s). We did not exclude based on reproductive intent; use of non-hormonal contraception (e.g. condoms) was permitted. We did not exclude based on cycle regularity. Recruitment was conducted via Exponent, a contract research organization, using the Fieldwork Network database, which is compiled from previous engagements, and self-registration via physical recruitment opportunities and social media. Fieldwork identified individuals who may be eligible for the study based on inclusion criteria and contacted them by phone. Recruitment targeted distribution across demographic bins related to age (<18, 18–24, 25–34 and 35–45 years old), BMI (17–25, 26–30, 31–40), and race/ethnicity (participant self-identified, including White/Caucasian, Black or African American, Hispanic, Asian, American Indian, Middle Eastern, and Native Hawaiian). Study invitation and enrollment were conducted virtually. The study had a duration of 12 months. Participant time in the study was dependent on the time at which they enrolled. Sample size was evaluated for 90% power using a two-sided Type I error of 0.05 across a range of expected mean differences and standard deviations for fixed equivalence limits of ±2 days assuming one data point per participant. When a participant contributed multiple data points, calculating the sample size based on a single observation per participant gives us a conservative sample size, given we have more information from multiple observations per participant. Prior studies done within our research program had suggested a mean difference (between the algorithm and ground-truth reference of LH testing) of approximately 1.1 days with a standard deviation of approximately 4.4 days, yielding a minimum sample size of N = 273 evaluable ovulatory cycles to achieve 90% power, which maps to at least 253 participants.
The study was approved by the Institutional Review Board at Advarra CIRB PRO00054031 and is registered on ClinicalTrials.gov (ClinicalTrials.gov Identifier: NCT05852951 ). Informed consent was obtained virtually, with the ability of participants to opt out should they choose. Minor participants had an in-person visit for study enrollment, assent, and consent. Compensation was provided: $50 for training completion, $250 for the initial compliant month, $62.50 for each additional compliant week, $200 for every 12 weeks of compliance, and retention of provided Apple Watch at study completion if at least 16 weeks have been completed.
Participants were shipped study supplies, including LH urine test strips (Pregmate Ovulation Test Strips, Pregmate, Fort Lauderdale, Florida, USA), an oral thermometer (Easy@Home Smart Basal Thermometer, Premom, Easy@Home Fertility, Burr Ridge, Illinois, USA) for collecting BBT, an Apple Watch, and an additional Apple Watch prototype that measured overnight wrist temperature. A video visit was used to set up the devices and review the study protocol. Data collection was via an app that allowed participants to provide demographic information such as weight, height, race/ethnicity, complete surveys, and manually log results of testing (e.g. urine LH test results) via their iPhone ( “Apple Research app,” n.d. ). Participants logged menstrual bleeding and symptoms in the Health app, a central and secure place for health information available on iPhone ( Apple Cycle Tracking, n.d. ; Mahalingaiah et al. , 2021 ). Minimum Apple Watch wear was at least 6 h while awake and overnight, at least 4 h while sleeping. Wrist temperature provided to the algorithms represents temperature sampled every 5 s from the participant’s wrist during sleep, corrected for environmental bias, then aggregated into a single sleeping wrist temperature; this temperature value is available in the Health app for consumer and research use ( “Apple Developer Documentation, Sleeping Wrist Temperature,” n.d. ). Daily, first-morning urine LH testing was started on cycle day 7 and continued until either a surge was detected or the next menses started. Oral BBT was collected each morning, immediately upon awakening ( Supplementary Fig. S1 ).
Cycle day 1 was defined as the first day of logged menstrual flow. One day after a logged positive LH test was defined as the day of ovulation and used as the ovulation day label for algorithm testing ( Su et al. , 2017 ). Leveraging large, population-based studies representing several million menstrual cycles, and cutoffs that would put cycles outside of the standard deviation of the mean for the population in the atypical category, we chose to define a typical cycle length as a cycle length between 23 and 35 days, while atypical cycle lengths were outside of this range ( Fraser et al. , 2011 ; Munro et al. , 2018 ; Bull et al. , 2019 ; Li et al. , 2020 ). Ongoing menstrual cycles were defined as cycles where the next menses have not yet been logged, and therefore the full cycle length/number of cycle days is not yet known and available to the algorithm. Completed menstrual cycles are defined as cycles for which the total number of cycle days is known and available to the algorithm because the subsequent menses have been logged. Retrospective ovulation day estimate is the day of ovulation as estimated by the algorithms using temperature. Algorithm 1 is defined as the algorithm which uses the first day of the last period and wrist temperature as inputs and provides a retrospective ovulation estimate before the cycle is completed. Algorithm 2 is defined as the algorithm which uses the first day of the last period, wrist temperature, and cycle length as inputs and provides a retrospective ovulation estimate after the cycle is completed. Algorithm 3 is defined as the algorithm which uses the first day of the last period and wrist temperature as inputs and provides a next menses start day prediction. Specific algorithms, including their modeling, development, inputs, and associated information, are considered Apple proprietary information and are not included here.
For all analyses, cycles were excluded if the gap between cycle day 1 and ovulation (follicular phase) was 34 days, or the gap between ovulation and cycle end (luteal phase) was 23 days, to exclude cycles that may have been attributed to errors in participant logging, leading to exclusion of 61 cycles (4.7%) ( Sherman and Korenman, 1975 ; Baird et al. , 1995 ; Cole et al. , 2009 ; Najmabadi et al. , 2020 ). Among the 61 excluded cycles, 2 (3.3%) have luteal phase 23 days, 16 (26.2%) have follicular phase 34 days.
Discussion
The algorithm using wrist temperature can provide retrospective ovulation day estimates in 80.5% of ongoing menstrual cycles with a ≥0.2°C temperature shift, and 80.0% of these estimates are within ±2 days of the ovulation day (Algorithm 1). Algorithm 2 provided retrospective ovulation day estimates in 80.8% of completed menstrual cycles with a ≥0.2°C temperature shift, with 89.0% of estimates being within ±2 days of ovulation. Algorithm 3 provided next menses start day predictions within ±3 days of menses start in 89.4% of cycles with a ≥0.2°C temperature shift. Similar results were seen when evaluating participants with all typical or some atypical cycle lengths.
With increasing availability of wearable devices that can evaluate temperature, it is important to understand expected differences if one selects a wearable versus more traditional daily BBT measurements for reproductive health monitoring ( Izmailova et al. , 2018 ). Wearable temperature sensors may not replicate the exact BBT values, but rather provide an alternative temperature measurement that can be used to evaluate menstrual physiology ( Uchida and Izumizaki, 2022 ). In a study of 57 participants, comparing oral BBT measurements to wrist skin temperature, overnight continuous wrist skin temperature was more sensitive than BBT and had a higher true-positive rate (54.9% vs. 20.2%) for detection of ovulation, while also demonstrating a higher false-positive ovulation detection (8.8% vs. 3.6%) ( Zhu et al. , 2021 ). A pilot study of 22 participants using a sensor ring for nocturnal finger skin temperature measurements found correlation between skin and oral BBT, finding the temperature differences between follicular and luteal phases were higher with the skin temperature than with oral temperature ( Maijala et al. , 2019 ). In contrast, a study of 16 individuals who collected upper arm skin temperature found that under both quantitative and visual analysis, there was no agreement between upper arm skin temperature and BBT for evidence of ovulation ( Wark et al. , 2015 ). Here we found that in both ongoing and completed menstrual cycles, Algorithms 1 and 2, using wrist temperature, provided a retrospective ovulation day estimate in a larger percentage of cycles than using BBT in the algorithms ( Zhu et al. , 2021 ).
A variety of sensor types exist to collect temperature readings that can be used to estimate the retrospective day of ovulation. In a study of 74 participants with diagnosed infertility, participants wore an axillary thermometer patch for 7 days during a predicted fertile window and confirmed ovulation in 81.8% of the cases, confirming 21.7% with exact accordance, 35.1% within 1 day of ovulation, and 12.2% within 2 days of ovulation ( Weiss et al. , 2022 ). Using wrist skin temperature from a sensor bracelet, in a study of 126 participants with an average cycle length of 28.8 days, a biphasic skin temperature pattern was identified in 82% of cycles with confirmed ovulation detected using LH testing ( Shilaih et al. , 2018 ). Evaluating a variety of algorithms, the use of finger temperature provided sensitivities up to 83.3% for ovulation day detection from −3 to +2 days around the verified ovulation, though they report excluding the two participants with BMI above 30, stating BMI may be a potential confounder affecting distal skin temperature and risk for menstrual disorders ( Maijala et al. , 2019 ). Our method of retrospective ovulation day identification appears to have similar accuracy to others based on wrist skin temperature and finger temperature in previous reports, as characterized as the proportion of cycles with an estimate within ±2 days of the ovulation day labeled by positive LH test.
Performance of menstrual cycle start day predictions across a range of cycle lengths can be challenging, with many existing studies or products limiting the cycle length that is supported. A study collecting temperature from an in-ear device and heart rate reported menstrual start day predictions with an accuracy of 89.6% among regular menstruators (cycle lengths 25–35 days) and 72.5% in irregular menstruators (cycle length 35 days). In a pilot study of finger temperature, 22 participants with cycle lengths ranging from 21 to 50 days wore a sensor ring for temperature collection, giving a sensitivity of 81.4% to identify the menstrual start day within ±3 days ( Maijala et al. , 2019 ). Our work demonstrates that for participants with all cycle lengths between 23 and 35 days, as compared to those with some atypical cycle lengths, there was a similar ability of the algorithm to provide a menstrual start prediction within ±3 days of the actual menses start, with the difference in MAE being about 1 day more for atypical cycles.
Inclusion of data from wearable devices is increasingly being considered in the clinical care setting ( Wang et al. , 2018 ; Weng et al. , 2024 ). The use of wearable temperature data to estimate ovulation may have clinical applications when patients present seeking advice on timing of intercourse, as review of logged sex alongside ovulation estimates can inform these conversations ( Stanford et al. , 2002 ; Gibbons et al. , 2023 ). In other cases, calculating a personal luteal phase length, rather than relying on population-level data, may assist in more accurate prediction of upcoming fertile windows for individuals seeking to conceive ( Vitzthum et al. , 2021 ). Moreover, retrospective knowledge of ovulation timing informs the prediction of the next menses onset, which for many people is the primary reason they track their cycles. Because the luteal phase tends to be more consistent in length than the follicular phase, it can predict the subsequent menses onset more precisely than cycle length alone. While no method should be interpreted in isolation, increasingly patients are turning to data from digital applications for fertility and family planning ( Su et al. , 2017 ).
This study has several strengths. First, many studies of menstruation or reproductive health focus on individuals who are trying to conceive, while here we included participants regardless of reproductive intent. Our participants included early menstruators and individuals in menopausal transition, a wide range of represented BMIs, and included significant representation of Hispanic, Latina, Spanish, and/or other Hispanic, as well as Asian, and Black or African American or African participants, thus extending the generalizability of the results. By not limiting our evaluation to those with the most regular or typical menstrual cycle lengths, we have also demonstrated that wrist temperature can be utilized for those with a wider range of cycle lengths. Most traditional uses of BBT for menstrual health have focused on retrospective identification of ovulation, which may not be relevant for those disinterested in family planning. The use of wrist temperature for the next menses start day prediction adds value for those who may not need to track their ovulation but can utilize menses start predictions. Additionally, the perceived burden of taking daily BBT may be alleviated with the use of wrist temperature ( Martinez et al. , 1992 ; Shilaih et al. , 2018 ).
This study has several limitations. While reliance on urine LH testing to identify ovulation is highly practical and matches how many individuals track their personal fertility metrics, it may mislabel some cycles. This may occur because an LH surge was never detected but ovulation did indeed occur, because the LH surge was not followed by ovulation, as can be the case in those with polycystic ovary syndrome, or due to user error ( Coyle and Campbell, 2019 ). Our evaluation of false retrospective ovulation day estimates reinforces the idea that this estimate does not provide clinical confirmation of ovulation and should not be used in isolation. Reported fevers were rare (4% of cycles), which prevents us from evaluating the impact of incidental fevers or medications on algorithm performance. Additionally, the ability to label cycles as anovulatory based on temperature changes is outside the scope of the report, as lack of temperature signal at the wrist may be impacted by multiple factors. An additional challenge is that the use of temperature (wrist or BBT) as a signal for ovulation inherently means that the identification of ovulation day is retrospective. While retrospective ovulation day estimation has significance for those tracking their physiology, temperature alone can present an incomplete picture of the menstrual cycle; additional information from urine LH testing, urine progesterone testing, or monitoring cervical mucous quality may help an individual or clinician better define the fertile window and cycle events ( Baird et al. , 1995 ; Hassoun, 2018 ). Evaluation of menstrual cycles and temperature patterns in groups with specific conditions, such as polycystic ovary syndrome, endometriosis, or thyroid conditions, may expand our understanding of menstrual cycle pathology. Additionally, the use of wearables may allow collection of behavioral or physiologic metrics, such as activity and sleep, that may impact cycle patterns and ovulation.
Conclusions
Algorithms using wrist temperature can provide retrospective ovulation estimates and next menses start day predictions for individuals with typical or atypical cycle lengths. Inclusion of wrist skin temperature, along with other physiologic measurements, can aid in menstrual cycle tracking and understanding cycle patterns.
Statistical
The accuracy of retrospective ovulation day estimate performance from Algorithms 1 and 2 was evaluated with the following statistical metrics: mean error (ME), mean absolute error (MAE), and percentage of absolute error within 2 days (PAE2). The ground-truth ovulation date of a cycle was determined based on LH testing. Specifically, the ovulation estimate error for a menstrual cycle was derived as the algorithm’s ovulation estimate minus ground-truth ovulation date, where a negative error means the estimated ovulation date is earlier than the ovulation date based on LH label. ME was calculated as the mean of ovulation estimate error over a collection of cycles, e.g. 1 n ∑ i = 1 n Error i , and MAE was calculated as the mean of absolute ovulation estimate error over a collection of cycles, e.g. 1 n ∑ i = 1 n | Error i | . We further estimated 95% CI of ME and MAE using an empirical bootstrap approach. In detail, bootstrap resampling was conducted at the participant level. For each of the 1000 bootstrap resamples, ME was calculated, and the 95% CI was the 2.5th and 97.5th percentile of the ME. This non-parametric approach does not rely on any distributional assumptions for the statistics of interest and handles potential within-participant correlation among cycles. PAE2 is defined as the percentage of absolute error within 2 days, e.g. 1 n ∑ i = 1 n 1 ( | Error i | ≤ 2 ) . All metrics were evaluated for ongoing and completed menstrual cycles separately.
The similar set of error metrics was used to evaluate the accuracy of the next menstrual start day prediction from Algorithm 3 based on delta between the predicted menstrual start day generated by the algorithm using temperature and the user-logged menstrual start day: ME (95% CI), MAE (95% CI), and percentage of absolute error within 3 days (PAE3).
Sensitivity and false retrospective ovulation estimate rate for Algorithms 1 and 2 were evaluated. Sensitivity, or the rate of ovulation estimates, was calculated as the percentage of cycles where retrospective ovulation day was estimated by the algorithm among cycles with a positive LH test result. The false-positive rate, or false estimate rate, of the retrospective ovulation estimate algorithms for ongoing and complete menstrual cycles was also evaluated, which denotes the percentage of cycles where ovulation was estimated by the algorithm among cycles without a positive LH test (i.e. anovulatory cycles). However, because we did not have any ground truth for anovulation in this study, we considered a simulation-based approach: white-noise trials without ground-truth biphasic shift signals were generated with the noise set at the level typically produced by the temperature hardware. We estimated the false-positive rate as proportion of ovulations estimated in these trials.
Latency was also reported for ongoing cycles (Algorithm 1), which can be used to understand how soon the algorithm can make a retrospective ovulation day estimate during an ongoing cycle. The latency metric is defined as the delta between the cycle day where ovulation occurred and the cycle day when the algorithm outputs an estimate.
The above metrics were reported for all cycles, as well as for the subset of cycles with wrist temperature rise of ≥0.2°C, as this is the threshold of temperature change previously reported to be associated with ovulation ( Vollman, 1977 ; Moghissi, 1980 ; Royston and Abrams, 1980 ; Coward and Wells, 2013 ; Shilaih et al. , 2018 ; Yu et al. , 2022 ). Additionally, since individuals with some atypical cycle lengths may present a more difficult scenario for retrospective ovulation day estimate, we evaluated retrospective ovulation day estimate and next menses start day predictions for participants with all typical or some atypical cycles. To compare the performance of wrist temperature to manually logged BBT data, where both temperature types were available in the same cycle, the same procedure was adopted to input the BBT data into the three algorithms, and the same metrics were calculated. The set of cycles and users used in the wrist temperature and BBT analysis are constrained to the same for a fair comparison. To compare the accuracy of wrist temperature to the calendar method, we conducted three permutation tests on MAE. Similar to bootstrap approach, permutation test is used to handle non-Gaussian-distributed data with a with-in sample correlation structure ( Henderson, 2005 ). First, we calculate the absolute errors of wrist temperature and calendar method for each cycle based on the ground truth. The observed statistic (obs_stat) is defined as the average of cycle-level absolute error delta between calendar and wrist temperature methods, e.g. obs_stat = mean of (error from calendar|–|error from wrist temperature) over all cycles. Second, we apply permutation for 5000 times. Under each permutation, we randomly switch the absolute errors between calendar and wrist temperature methods and re-calculate the permutation-specific statistic using the same formula, e.g. permute_stat. Last, the empirical P -value is calculated based on percentage of times when | permute_stat | is larger than | obs_stat |. Similarly, three additional permutation tests were further conducted to compare PAE2 for ongoing menstrual cycles (Algorithm 1) and historical menstrual cycles (Algorithm 2) and compare PAE3 for the menstrual cycle start (Algorithm 3) to the calendar method.
Introduction
Tracking menstruation is the first step toward understanding reproductive health and fertility patterns ( Epstein et al. , 2017 ). The calendar method is widely used, freely available, and can provide estimates of fertile windows and period starts based on tracking of cycle length, though it performs best for those with high cycle regularity and more typical cycle lengths ( Stanford et al. , 2002 ; Jensen and Wrede, 2020 ). Ovulation can be detected using ultrasonography; however, this is difficult to access and utilize for individuals who are not undergoing fertility treatment; instead, patients and consumers are increasingly utilizing at-home methods to track ovulation and other cycle events. Those who menstruate can monitor the consistency of their cervical mucus for changes and identify an ongoing fertile window ( Pyper, 1997 ; Ecochard et al. , 2001 ; Najmabadi et al. , 2021 ). Cervical mucus monitoring can support couples with sub-fertility who are seeking conception but require daily self-evaluation ( Graham et al. , 1983 ; Stanford et al. , 2022 ). Basal body temperature (BBT) tracking and identification of a luteal phase temperature elevation, typically a temperature change of 0.2°C in response to the increase in progesterone produced by the corpus luteum of the post-ovulatory follicle, can be used to confirm ovulation, though measurement can be inaccurate based on behavioral or environmental factors and requires repeated testing ( Matthews et al. , 1980 ; Barron and Fehring, 2005 ). Prediction of ovulation can be achieved with home, urine-based kits for LH surges, which detect the surge in LH that typically occurs 24–36 h prior to ovulation ( Fukunaga et al. , 1983 ; Singh et al. , 1984 ). Urine-based testing for a urine metabolite of progesterone provides confirmation of ovulation in the current, or ongoing, cycle ( Leiva et al. , 2019 ). For many individuals, tracking cycle events involves multiple modalities depending on their needs ( Pichon et al. , 2022 ).
Increasingly, it has been demonstrated that external sensors, such as wearable devices, can be used to detect the physiologic changes associated with the menstrual cycle and can complement tools for menstrual cycle tracking ( Fehring, 2005 ; Goodale et al. , 2019 ). These wearables remove the need for proactive self-testing as well as the user education required to execute the variety of testing types ( Howie, 1993 ; Hassoun, 2018 ). Various wearable devices measuring temperature have been used to estimate ovulation and may be less susceptible to temperature changes unrelated to ovulation brought on by sleep disruption, dietary changes, or physical activity that were concerns for traditional BBT ( Shilaih et al. , 2018 ; Hurst et al., 2022 ; Uchida and Izumizaki, 2022 ). In this study, we evaluate three algorithms available on compatible models of iPhone and Apple Watch which use wrist temperature from Apple Watch to retrospectively estimate the day of ovulation in ongoing and completed menstrual cycles and to the predict next menses start day. We also compare results based on wrist temperature with those based on BBT.
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.