Design Patterns for the Development and Implementation of Bioacoustic Deep Learning Recognizers

preprint OA: closed
Full text JSON View at publisher
Full text 48,942 characters · extracted from oa-doi-fallback · 6 sections · click to expand

Abstract

Bioacoustic monitoring using autonomous recording units generates volumes of audio data that exceed the capacity of manual annotation, necessitating the use of automated species recognizers. Convolutional Neural Networks (CNNs) have emerged as the dominant approach due to their strong performance on spectrogram representations of audio data, yet their development and deployment remain challenging. This paper presents a set of reuseable design patterns that address recurring methodological challenges in two key areas: 1) CNN recognizer development and 2) integration of recognizers into practical bioacoustics workflows. Drawing on a case study involving the development and implementation of a single-species CNN recognizer for the western toad in Banff National Park, Canada, we illustrate patterns related to data leakage, sampling bias, signal processing decisions, hyperparameter optimization, model training, and workflow integration. Each pattern is described in a structured problem/solution format and supported with real-world examples. Collectively, these patterns provide a comprehensive framework spanning recognizer design, deployment, and even iterative improvement through user interfaces and active learning. By formalizing best practices, this work aims to improve the reliability, efficiency, and accessibility of CNN-based bioaoustic monitoring across a wide range of ecological applications.

Introduction

Bioacoustic surveys are increasingly used to collect data for a diverse array of sound-producing animal taxa and ecological monitoring applications (Brooker et al., 2020; Stowell, 2022). The use of autonomous recording units (ARUs) has gained popularity due to their ability to provide permanent survey records, minimize observer bias, reduce fieldwork time, and improve scalability (Knight et al., 2020; Shonfield & Bayne, 2017). The capacity to extend sampling area and duration with ARUs significantly surpasses the rate at which human observers can annotate audio recordings (MacPhail et al., 2024). In response to this challenge, automated computer detection algorithms, referred to as ’recognizers,’ have been developed to identify the vocalizations of species (Knight et al., 2017). Among these, Convolutional Neural Networks (CNNs) have become the de facto standard due to their superior performance and their ability to bypass traditional pre-processing steps, such as noise reduction, signal detection, and feature extraction (Brown, 2024; Stowell, 2022) While CNN recognizers have the potential to revolutionize bioacoustic monitoring, they can be difficult for ecologists to develop due to the complexities of artificial neural networks (Brooker et al., 2020). The development of CNN recognizers relies upon advances in audio processing, bioinformatics, computer engineering, and mathematics (Brooker et al., 2020). Several sound analysis platforms have been developed to facilitate the use of recognizers. However, not all species or ecological applications are supported by these platforms (Brooker et al., 2020; Brown, 2024). Consequently, bioacoustic practitioners are often required to design their own recognizer (Brown, 2024). To help address this challenge, this paper presents design patterns for the development of CNN recognizers and their integration into bioacoustic workflows. Design patterns have been used to address recurring challenges in methodological processes (Greenberg et al., 2019). In bioacoustics, challenges often emerge in two disparate areas: 1) the development of CNN recognizers, and 2) the integration of recognizers into bioacoustic workflows (Shonfield & Bayne, 2017; Stowell, 2022). The design patterns presented in this study provide structured templates and reusable solutions to commonly occurring problems within these areas. Each pattern will be illustrated with a real‐world example drawn from the development and implementation of a western toad ( Anaxyrus boreas ) CNN recognizer in Banff National Park, Canada. Collectively, these patterns represent a comprehensive framework that captures an entire bioacoustic recognizer program, from initial design to implementation. General Methodology Our methodology for identifying useful design patterns was informed through a western toad monitoring project in Banff National Park. The aim of this project was to estimate long-term changes in the spatial distribution of western toads. We developed a single-species CNN recognizer and streamlined workflows through a user interface. Many challenges incurred during this project are likely to arise in other bioacoustic projects, such as balancing automation with analyst oversight. The design patterns presented here aim to provide generalizable solutions intended for a broad range of applications including multi-species recognizers. The generalizability of design patterns stems from their focus on identifying and addressing underlying forces within a design problem (Marinescu, 2002). We present each design problem in Alexandrian form, starting with a description of the context and nature of the problem followed by the proposed solution (Alexander, 1977, Marinescu, 2002). For each design pattern, we included western toad examples to illustrate a practical application. This structure is intended to equip bioacoustic practitioners with the knowledge and insight needed to adapt these solutions to future bioacoustics challenges. CNN recognizer development The high-performance of CNN recognizers can be largely attributed to their ability to exploit grid structures in raw image data (Goodfellow et al., 2016; Stowell, 2022). CNN recognizers can process snapshots of audio data as spectrograms (Stowell, 2022). As spectrograms are a visual representation of an acoustic record, their use as input data negates the need to reduce the acoustic data into a small number of summary features (Brown, 2024). As a result, CNN recognizers can operate in a much higher dimensional spaces, enabling them to distinguish between subtle variations in similar acoustic signals (Stowell, 2022). The use of high dimensional data in the development of CNN recognizers requires considerable computational power and time. For example, (Brown, 2024) noted that training a single CNN recognizer can take 3.5 hours on a powerful graphics processing unit . However, this up front cost is mitigated by subsequent rapid acoustic recording processing of new data, particularly in comparison to recognizer models that rely on costly pre-processing, segmentation, and acoustic feature extraction (Brown, 2024). Given the resources required to develop a CNN recognizer, minimizing methodological errors and streamlining model development help prevent the unnecessary repetition of steps in the development process. While the conceptual framework for developing a CNN recognizer is straightforward—train a CNN to identify a species, tune the hyperparameters, and evaluate the final model—the process can be complicated by challenges such as limited data and limited computational resources, acoustic signal processing complexities, and a lack of familiarity with machine learning best practices. The design patterns discussed in the following sections are intended to provide guidance and structure to the development of a CNN recognizer, facilitating the enactment of standardized and reliable practices. Figure 1. Flowchart visualization of the developmental process used in the creation of the western toad CNN recognizer. Data Leakage Data leakage frequently causes errors in CNN , and its prevalence throughout machine learning applications has led to the declaration of a reproducibility crisis (Kapoor & Narayanan, 2022). Data leakage occurs when training, validation, or testing datasets contain overlapping data (Kapoor & Narayanan, 2022). The lack of data partitions can introduce bias that overestimates CNN performance generalizability. In bioacoustic projects, data leakage can arise from a lack of familiarity with machine learning practices or an insufficient amount of labelled data for adequately sized datasets. Stowell (2022) notes that many studies use the same bioacoustics data during model training and testing, constituting a form of data leakage. A less obvious form of data leakage occurs when pre-processing parameters are identified prior to the data partitioning; this permits samples from the testing dataset to influence the development of the recognizer. Design Pattern The partitioning of labelled data into training, validation, and testing partitions is best informed by their fundamental purposes: t raining data is used to teach the CNN to identify patterns in acoustic signals. The primary focus of a training dataset is to provide many classified samples of the vocalizations and the background acoustic environment; this provides an opportunity for the CNN to learn how to recognize patterns and variations within the acoustic signals and to differentiate signals from the acoustic environment. The validation dataset then helps tune the hyperparameters by comparing predictions from different training regimes to the classified samples in the validation dataset. Because the validation dataset is used repeatedly to compare different hyperparameter configurations, a form of data leakage is introduced. To address this, K-fold cross-validation can improve the reliability of hyperparameter selection and reduce the risk of Type I error from data leakage (Stowell et al., 2019). Similarly, the testing dataset is necessary to provide an unbiased estimate of the CNN recognizer’s real-world performance and account for the possibility that the CNN has been overfit to the validation dataset. Example We developed a western toad CNN recognizer using audio recordings collected between 2021 and 2023 with automated recording units (ARUs, Song Meter Mini; Wildlife Acoustics, Inc. Concord, MA, USA) deployed at over 20 ponds in Banff National Park. We deployed the ARUs during May and June during breeding season and programmed them to record vocalizations for 3 minutes every hour, between 8 pm and 4 am. After retrieval, we manually transcribed audio. Transcription involved listening for western toad calls and when present noting their occurrence at multiple instances per recording. ARUs also recorded abiotic sounds such as heavy wind, vehicles, and trains within the background environment. We split audio recordings into three-second segments, referred to as ‘clips’. We labelled clips with western toad vocalizations as positive and all other clips as negative. We first randomly partitioned audio clips recordings into training and validation datasets 0the (training pool; 90%) and testing (10%) data. Next, we randomly grouped audio recordings from the training pool to create five validation folds that contained approximately 20% of the total clips in the training pool. This approach mitigated the risk of data leakage from confounding variables. Following these partitions, the training pool consisted of 644 positives and 1,332 negative clips, while the testing dataset included 50 positive and 135 negative clips. Clip ratio bias During training, a CNN recognizer’s ability to accurately detect a species can be influenced by the proportion of clips containing that species’ vocalizations within the training data (i.e. clip ratio; Kapoor & Narayanan, 2022). CNNs often learn to recognize common species more effectively and can even default to predicting their presence. Analogously, in the case of a binary recognizer, a high proportion of negative clips may lead to higher specificity, but comparatively worse recall (i.e., an increased risk of false negatives). This phenomenon can occur regardless of the total size of the training data. If a CNN recognizer’s performance varies among species, the clip ratio of the validation and testing datasets will influence the overall performance assessment (Kapoor & Narayanan, 2022). A skewed ratio in the validation dataset may lead to suboptimal hyperparameter selection. Similarly, imbalanced ratios in the testing dataset can distort the evaluation of a CNN recognizer’s performance and lead to unexpected outcomes when deployed in the field. Design Pattern During training, bioacoustic practitioners can control the effects of clip ratios by resampling or differentially weighting clips (Stowell, 2022). Resampling adjusts the proportion of clips to achieve more balanced clip ratios. The training dataset can be resampled with replacement by duplicating instances from underrepresented species. Alternatively, the effects of unequal clip ratios in the training dataset can be controlled by differentially weighting clips within the loss function that is used to train the model. Furthermore, differential weighting can be used to tailor the development of the CNN with project-specific characteristics. Clip ratios must be carefully considered when evaluating CNN recognizer performance (both during validation and testing). Any resampling of the validation or testing datasets should be done without replacement to avoid duplication bias. For an unbiased estimate of a CNN recognizers performance, the testing dataset should reflect the real-world species distribution of the study area (Kapoor & Narayanan, 2022). For example, a species that is rarely encountered should have a proportionally small influence on performance metrics unless it is of specific interest to the researchers. However, because real-world distributions of species within a soundscape are often unknown it is important to interpret evaluation results with caution. One strategy to mitigate the effects of species distribution during evaluation is macro-averaging, which involves calculating metrics separately for each species and then averaging the results (Mesaros et al., 2016). Macro-averaging provides a balanced, reproducible, and intuitive measure of CNN recognizer performance. Example We used the training pool to conduct five-fold stratified cross-validation of the CNN recognizer. Folds were created at the audio recordings level to account for correlations within recordings. Stratification ensured consistent distributions across folds, with each containing approximately 30% positive and 70% negative clips, drawn from a training pool of 644 positive and 1,332 negative clips. During training, the training folds were resampled with replacement to create a balanced training dataset of 1,000 positive and 1,000 negative clips. To reduce false positives, the weight of positive clips in both the training and validation loss functions were scaled by a factor of 0.8. Five fold cross-validation was conducted by sequentially withholding one fold for validation while parameterizing the model on the other four. Calculation of a STFT window length CNNs are optimized to analyze visual data; it is therefore necessary to convert acoustic clips into spectrogram images. This conversion is generally accomplished through a discrete short-time Fourier transform (STFT; Stowell, 2022; Knight et al., 2020). The main parameters in this transform are the length, type, and overlap of the STFT window. A Hanning window with 50% overlap is commonly used in spectrogram generation to ensure full coverage of the audio recording, and to reduce spectral leakage, which occurs when energy from one frequency spreads into adjacent frequency bins (Cerna & Harvey, n.d.; Trethewey, 2000). However, the optimal window length depends on the characteristics of the acoustic signals, which in our case were amphibian calls (Knight et al., 2020). Longer windows provide better frequency resolution but reduce temporal resolution, forcing bioacoustic practitioners to find a balance appropriate for their species of interest (Stowell, 2022). Additionally, window lengths that are powers of two are often preferred because they enable the use of fast Fourier transform algorithms (FFT), which improve computational efficiency (Knight et al., 2020). Optimal window length is often selected through hyperparameter optimization, testing a range of values during model training and validation. However, the many possible parameter values can quickly overwhelm the available computation resources. Practitioners are therefore advised to propose initial window lengths based on characteristics of the species, such as typical call duration and the need to resolve fine-scale frequency features. Design Pattern Calculating a window length in advance can significantly reduce the scope of hyperparameter optimization. If either the time or frequency dimension of the spectrogram is more important for species detection, the problem can be approached as a constrained optimization. The first step is to prioritize the dimension with greater discriminative power for the species. Then, a minimum acceptable resolution in the secondary dimension can be estimated, which acts as an inequality constraint. An additional equality constraint, requiring that the window length is a power of two, ensures compatibility with FFT algorithms. The optimal window length can then be selected as the highest (frequency-priority) or lowest (time-priority) value that satisfies both constraints. If known, care should be taken to ensure that a minimum acceptable resolution in the priority dimension is achieved; if it is not, the inequality constraint on the secondary dimension must be re-formatted as a soft constraint with a graduating penalty. Example We used a STFT to create spectrogram images from audio clips (figure 2). We specified all STFT parameters directly without including any in in the hyperparameter optimization. We implemented the generally acceptable combination of a Hanning window with 50% overlap (Trethewey, 2000). To calculate an optimal window length, we first placed priority on the frequency dimension. This decision was based on the presence of other amphibian species in the study area that produced similar temporal call patterns. We hypothesized that the differences in frequency structure would provide the most reliable basis for species differentiation. To determine a minimum acceptable resolution in the time dimension (i.e. maximum acceptable window size), we estimated the shortest duration of a pulse from a western toad vocalization. Through visual observations of spectrograms, we found this to be approximately 0.02 seconds. Given the audio sampling rate of 16 kHz, this duration corresponded to 320 audio samples at a sampling rate of 16 kHz. We rounded this value down to the nearest power of two, 256 pixels, to establish a lower bound for the window length parameter of the FFT. Figure 2. Spectrogram of a western toad vocalization generated using a short-time Fourier transform with a 128-sample Hanning window and 50% overlap at a 16 kHz sampling rate. The spectrogram was band-limited below 4000 Hz and scaled in decibels. Pixel intensity represents relative signal energy. Hyperparameter optimization resource requirements Hyperparameter optimization (HPO) is a key component of machine learning and can significantly improve the performance of deep learning models (Hutter et al., 2019). Optimizing all hyperparameters values from first principles is rarely possible due to the complexity of the configuration space, a lack of information regarding the influence of the hyperparameters on the model, as well as conditionality between the hyperparameters (Hutter et al., 2019; Stowell, 2022). As a result, automated HPO algorithms are commonly used, such as Bayesian optimization, surrogate models, and genetic algorithms (Hutter et al., 2019). However, these HPO algorithms can be computationally expensive. Given the high cost of HPO, it is important to minimize unnecessary computation within each training instance. The duration of a CNN training instance can be measured in epochs, defined as a complete pass of the training data through the CNN. Epochs influence CNN performance in nonlinear ways, often characterized by an initial improvement followed by a plateau or decline as overfitting occurs. Early stopping procedures can detect this plateau and halt training to conserve resources without compromising performance (Goodfellow et al., 2016). Design Pattern Early stopping should be applied within each training instance of an HPO procedure. Its purpose is to detect the inflection point at which the CNN stops improving on unseen data and begins to overfit. This point can be estimated using predictions on a validation dataset. After each training epoch, a performance metric can be computed by comparing the CNN’s predictions to the true labels in the validation dataset. The change in this metric over successive epochs can then be assessed relative to a pre-determined threshold. If the improvement falls below this threshold, training is halted. Example We conducted a grid search across five hyperparameters. The default configuration consisted of a Residual Network (ResNet) architecture trained using the AdamW optimizer. AdamW is beneficial because it integrates regularization efficiently and, through Adaptive Moment Estimation (ADAM), reduces the need for a learning rate scheduler, thereby simplifying the grid search. No-regret data augmentations (i.e., those that do not alter the semantic content of bioacoustic signals) were randomly applied to the training data by default, as these have been shown to improve recognizer performance. Additional experimental augmentations were included in the hyperparameter grid. Other hyperparameters included three different learning rates, and three ResNet variants, yielding a total of 24 unique configurations (Table 1). For each configuration, we implemented five-fold cross-validation. Training was terminated either when early stopping criteria were met or after 100 epochs, whichever occurred first. We evaluated model performance using a sigmoid-transformed binary cross-entropy loss function. Early stopping criteria was defined as no improvement in validation loss across 16 consecutive epochs. | Data Augmentation: Rescale Spectrogram | True; False | | Data Augmentation: Add Noise to Spectrogram | True; False | | ResNet Layers | 34; 50 | | Learning Rate | 1e-3; 1e-4; 1e-5 | Table 1. Hyperparameters and values included in HPO. Training the final model After identifying the optimal hyperparameters through cross-validated grid search with early stopping, additional challenges can arise when training the final model. In line with neural scaling laws, CNN performance is expected to improve with increased training data (Hestness et al., 2017). It is therefore advantageous to train the final model with the entire training pool, combining both training and validation clips used during cross-validation. However, each fold in the HPO process may have stopped training after a different number of epochs due to early stopping, making it unclear how many epochs to use for the final model. Moreover, applying early stopping would require withholding a subset of labelled data for validation, which contradicts the goal of maximizing training data. As a result, selecting an appropriate training duration for the final model requires an alternative strategy. Regularization helps establish a training regime in which model performance plateaus rather than deteriorates. When no validation dataset is available, performance must be derived solely from the training dataset, which often leads to inflated estimates due to overfitting. Various techniques can mitigate overfitting. These include penalizing large parameter values (e.g., L1 and L2 regularization) and introducing random variation (e.g., data augmentation and dropout layers). The optimal regularization strength can be identified through cross-validation, where the optimal value is identified as the lowest magnitude that prevents overfitting without impeding learning. Example After completing the initial HPO, we selected the configuration with the lowest average validation loss. These hyperparameter values were then used as the default in a subsequent hyperparameter grid search focused on regularization. In this second search, we inserted a dropout layer before the final fully connected layer in the ResNet and tested four dropout percentages as a new hyperparameter. For each dropout percentage, we performed five-fold stratified grouped cross-validation, training each fold for 80 epochs. To assess the stability of the training regime and detect overfitting, we fit a linear regression to the validation loss from epochs 20 to 80, excluding earlier epochs when the CNN was likely still in the rapid learning phase. We then averaged the slopes and y-intercepts across folds for each dropout percentage. A higher slope was interpreted as an indicator of overfitting. We applied a threshold of 0.1 to the slope and selected the lowest dropout percentage that met this criterion. We then used this dropout percentage to train the final CNN model on the full training pool. Finally, we trained the model for 60 epochs and then evaluated on the withheld testing dataset to assess its real-world performance. Integration of recognizer into workflows Many bioacoustic recognizers are implemented as a series of Python or R scripts (Stowell, 2022). While these scripts support reproducibility, they may limit the accessibility and utility of the CNN recognizers for a broader userbase (Stowell, 2022). The output of a CNN recognizer typically consists of scores that approximate the probability of species presence within sequential segments of a recording (hereafter ‘predictions’). Regardless of the accuracy or reliability of a CNN recognizer, many bioacoustic workflows primarily use these predictions to subset and extract candidate audio clips for manual review. The need for analysts to programmatically identify, extract, and review clips, sometimes even generating audio playback and spectrograms, can reduce the efficiency and consistency of bioacoustic workflows. User interfaces (UIs) have been recognized as a key component for effectively integrating recognizers into bioacoustic workflows (Stowell, 2022). A well-designed UI can serve as an intermediary between the analyst and the recognizer, facilitating interaction and improving accessibility for non-technical users. At a minimum, a bioacoustic UI must host the recognizer and allow users to generate predictions from selected audio recordings. Beyond the basic functional requirements for bioacoustic UIs, there is no clear consensus on what constitutes the optimal UI design for bioacoustic applications (Stowell, 2022). The complexity of working with large hierarchical datasets, integrating multi-modal displays, and supporting flexible workflows presents unique design challenges. The following section explores UI features that can improve both the efficiency and accuracy of bioacoustic workflows. To inform this discussion, we developed design patterns during the creation of a UI for the western toad project. These patterns are intended to be flexible and can be adapted for other bioacoustic projects. By sharing them, we aim to offer practical insights for developing customizable, user-friendly interfaces across a wide range of bioacoustic projects. Navigating hierarchical datasets Bioacoustic projects often involve large datasets organized across multiple hierarchical scales. CNN recognizers typically process these datasets by segmenting full audio recordings, creating at least two levels: complete recordings and their constituent clips. Some projects further group recordings by spatial and temporal attributes, introducing additional layers of organization. A well-designed UI must facilitate navigation both across and within these hierarchical levels. It should also support playback and visualization of clips, while presenting relevant metadata for any selected item. These capabilities enable users to efficiently explore complex datasets and interpret CNN predictions in context. Design Pattern The UI must provide effective tools for navigating both audio recordings and their associated clips. This is typically achieved through navigators, UI elements that allow analysts to select from a list of available options. While various types of navigators exist, a straightforward and intuitive approach involves a recording navigator, presented as an interactive table that lists available audio recordings. Once a recording is selected, a corresponding clip navigator can display the clips within that recording. This structure supports the integration of recognizer predictions. At the clip level, prediction scores can be shown directly within the clip navigator, either as numerical values or with colour coding to support rapid visual interpretation. Since the recognizer processes audio at the clip level, raw predictions cannot be directly associated with the recording listings. However, a summary metric, such as the number of confirmed detections per recording, can be included in the recording navigator to provide quick insights. For larger-scale projects, the UI should support multiple recordings as input. In such cases, the recording navigator can be populated with the user-defined sets of recordings, organized by project-specific attributes such as location, date, or time of recording. This hierarchical structure enhances the efficiency of managing and analyzing complex acoustic datasets. Example We developed a UI to facilitate and enhance analyst interaction with the western toad recognizer at multiple project scales (figure 2). Analysts begin by specifying a directory containing audio recordings, which are then processed by the recognizer. The resulting predictions, along with the associated recordings, are used to populate the UI. This design allows analysts to define upper-level project structures based on the organization of their folder hierarchy. At finer project scales, the UI displays both a recording navigator and a clip navigator. The recording navigator is implemented as an interactive table, where each row represents a unique recording. Selecting a row loads the corresponding clips into the clip navigator, which is structured as an ordered grid of sequential clips. Selecting a clip triggers audio playback and displays its spectrogram representation. In addition to navigation, the UI integrates recognizer predictions and manual verification across project scales. Within the clip navigator, predictions are visualized using a colour gradient that fills the background of each grid cell, enabling quick identification of prediction distributions or high-scoring clips. At the recording level, the recording navigator includes a dedicated column that flags the detection of a western toad in any of its clip. An export feature allows analysts to create a CSV summary of all detected western toad vocalizations across the dataset, supporting broader data aggregation and analysis. Figure 3. UI design elements that enable navigation between and within hierarchical bioacoustic datasets. Variable effects of threshold values When used alongside a threshold score, the predictions generated by a recognizer can support automatic classification. This approach can be used to identify clips that do not contain the species, those that do, or both (Stowell, 2022). However, the choice of threshold value can significantly impact the outcome of any bioacoustic auto-classification task (Knight & Bayne, 2019). Low thresholds have been shown to produce more false positives, while high score thresholds increase the probability of missing the occurrence of a species (Knight & Bayne, 2019). This trade-off makes threshold selection a critical consideration in bioacoustic recognizer workflows. The problem is further complicated in how recognizers assign prediction scores. The scores may not accurately reflect true probabilities, as recognizers can produce under- or over-confident predictions. As a result, predictions scores may not be directly comparable across different species (Stowell, 2022). Design Pattern Given the trade-off between type I and type II errors, as well as the variable relationship between prediction scores and true probability, any UI that supports automatic classification must allow analysts to manually adjust threshold values and inspect classifications across different thresholds. To enhance this functionality, the UI can support both a lower and an upper threshold. Under this dual-threshold method, clips with prediction scores above the upper threshold are automatically classified as detections, while those below the lower threshold are considered absences. This approach helps mitigate the trade-off between false positives and false negatives that typically arises when using a single threshold. However, clips with prediction scores falling between the two threshold values remain unclassified. For some applications, these unclassified clips may be negligible; in others, they can be flagged for manual review and classification. Example We implemented user-defined lower and upper threshold for automated classification (figure 3). Analysts interact with numerical inputs to set these bounds, which are then used to automatically classify all clips with prediction scores outside the specified range. Classification results are displayed in the clip navigator. At the recording level, a detection indicator is also displayed in one of the recording navigator’s column. Figure 4. Features to support manual and automated classifications. Supporting manual and automated classifications Many bioacoustic projects require the option to manually classify clips. Depending on the accuracy of recognizers, relying solely on them can compromise data quality. A well-designed UI should support efficient manual review and provide a way to reconcile differences between manual and automated classifications. Design Pattern Manual classification is most effective when analysts can review both acoustic and visual information together. Integrating spectrogram visualization with audio playback enables thorough inspection, while intuitive interface elements, such as buttons or dropdowns, reduce the likelihood of user error. Once a manual classification has been made, the UI needs to support its integration with automated classifications. The UI should record the source of each classification (e.g., recognizer or analyst). Manual classifications should take priority in cases of disagreement to reflect their higher reliability. At the recording level, summaries should follow clear rules: a species can be considered detected once any clip is marked as a detection but should not be marked absent until all clips have been reviewed without detection. Summaries or lists of detections can then be displayed for each recording. Example In the Western Toad project, we implemented features to support manual classification (figure 3). When a clip is selected, the corresponding audio playback and spectrogram are displayed, and analysts can record detections using colour-coded buttons. By default, clips are either tagged with a system-generated transcriber label if they have been auto-classified or left blank if unclassified. When a clip is manually classified, this label is replaced with the analyst’s username, represented in the clip navigator with distinct symbols. In the recording navigator, transcriber information and detection status are summarized in dedicated fields. If a recording contains no detections but still has unclassified clips, both fields remain blank, signalling that review is incomplete. This design provides a transparent record of manual and automated classifications, supports efficient workflow, and helps maintain data quality. Efficiency of manual classification Manual classification tasks often require analysts to navigate large volumes of audio, with substantial time spent moving between recordings and individual clips. In bioacoustic projects, this can involve sifting through considerable noise for each detection. The repetitive nature of this task can reduce manual classification accuracy and increase analyst fatigue. Design Pattern To improve efficiency, recognizer predictions can guide the order in which clips are presented for manual classification. Clips can be sorted through prediction score, so that clips most likely to contain detections are prioritized. Partially automating of clip classification, can further streamline workflow by including only unclassified clips for manual review. Optional automatic transitioning between recordings can also be implemented, but should remain under analyst control to avoid disrupting navigation within fully reviewed recordings. These principles help analysts to process large volumes of audio more efficiently. Example We designed the western toad bioacoustic UI to streamline analyst workflows (figure 4). Clips are sorted by prediction score, and the highest-ranking unclassified clip is presented first. Once a clip is manually classified, the next highest-ranking unclassified clip is automatically displayed. We also included an optional feature that automatically advances to the next unclassified recording once the current recording is classified. This process continues until all recordings are classified or the analyst chooses to manually intervene. Figure 5. Automatic selection of unclassified clip with highest prediction score. Complexity of multi-species classification As the number of species under consideration increases, classification tasks become more complex. This complexity arises because analysts must shift attention between species, rather than focusing on a single species at a time. As task complexity grows, so does the risk of human error, which can reduce the reliability of results. Design Pattern To mitigate the challenges of multi-species classification, workflows can be partitioned by species. A species selection feature allows analysts to focus on one species at a time, reducing the need to switch attention. Once a species is selected, the UI can reconfigure the recognizer predictions into a binary format, enabling a single-species review process. Although classifying multiple species in parallel may be faster, a serialized approach simplifies the task, improving reliability. This design also accommodates analysts who are only qualified to classify a subset of species. Example Although the UI developed for this project was focused on western toads, we additionally developed a recognizer for wood frogs in a related project. To manage both species in a single UI, we incorporated a toggle switch that allowed analysts to shift between single-species binary workflows (figure 5). Selecting a species adjusts the display and functionality accordingly, while all other design patterns remain unchanged. This approach maintains a consistent workflow and minimizes the potential for error when working with multiple species. Figure 6. Serialize multi-species classification tasks into binary workflows

Discussion

Bioacoustic signal detection and classification projects benefit greatly from advances in machine learning. Automated recognizers can rapidly and accurately process large volumes of acoustic recordings, enabling expanded survey efforts at lower costs. While bioacoustics analysis for many species is supported through public software platforms or repositories, others require the development and integration of bespoke CNN recognizers. Developing state-of-the-art CNN recognizers requires careful adherence to best practices in signal processing and machine learning. The design patterns discussed in section 3 provide both conceptual and practical guidance across this process. Once developed, the full benefits of a CNN recognizer depend on its integration into a well-designed UI. Although the optimal features of a bioacoustic UI depend on a study’s goals, the patterns outlined in section 4 highlight key principles for building effective UIs. Together, these patterns form a framework for an entire bioacoustic recognizer program, from initial design to implementation. Importantly, the integration of recognizers into a UI creates opportunities for active learning, where model development and implementation become part of the same iterative process. Recognizers are first trained using available data, the deployed through the UI, where analysts review predictions, flag errors, and efficiently classify new data using predictions. The additional transcribed data can be used to retrain models. This process can be repeated creating a continuous feedback loop that steadily improves performance. Through this process, the UI facilitates efficient classification and model refinement. Automated recognizers can enhance a wide range of bioacoustic applications. They can be applied to any species that produce a distinct acoustic signal, including birds, cetaceans, terrestrial and aquatic mammals, anurans, insects, and fish (Brown, 2024; Stowell, 2022). Bioacoustic data have been used to measure biodiversity, estimate species densities, detect cryptic species, and model occupancy, distribution, and migratory movements (Brown, 2024; K. V. S. N. et al., 2020). In BNP, bioacoustic surveys are used to model Western Toad occupancy. The integration of an automated recognizer has enabled a substantial expansion of survey effort and improved the precision of occupancy models. Beyond enhancing ongoing bioacoustic programs, the reduced cost of automated surveys makes large-scale efforts feasible even in resource-limited settings. Machine learning offers opportunities beyond bioacoustics. Computer vision tools have been developed to automate camera trap image classification and quantify vegetation abundance and composition in small-scale quadrats (Beery et al., 2019; McCool et al., 2018). As these tools evolve, they are likely to drive further innovation across ecological research. Although this paper focuses on bioacoustics, many of the design patterns discussed here can be adapted to other ecological disciplines. Thoughtful development and implementation of machine learning tools holds strong potential for enhancing ecological programs and improving our understanding and management of ecosystems. Author Contributions Gavin Hurd Data curation-Supporting, Formal analysis-Lead, Writing - original draft-Lead, Writing - review & editing-Equal Robin Baron Conceptualization-Equal, Data curation-Lead, Writing - review & editing-Equal Jesse Whittington Conceptualization-Equal, Supervision-Lead, Writing - review & editing-Equal

Acknowledgements

We thank Parks Canada Resource Conservation staff in the Ecological Integrity functions for collecting and classifying hundreds of hours of audio recordings. Conflict of Interest Statement The authors declare no conflicts of interest. Data Availability Statement All data and code supporting the designs and workflows described in this study are publicly accessible. The convolutional neural network recognizer training scripts are archived on Zenodo and are available at: https://doi.org/10.5281/zenodo.17993137 The source code for the user interface is archived on Zenodo and is available at: https://doi.org/10.5281/zenodo.17993133 A subset of the raw audio recordings used in this study is archived on Zenodo to facilitate replication and review and is available at: https://doi.org/10.5281/zenodo.17993187

References

1. Alexander, C. (1977). A Pattern Language. Oxford University Press .Beery, S., Morris, D., & Yang, S. (2019). Efficient Pipeline for Camera Trap Image Review (No. arXiv:1907.06772). arXiv. https://doi.org/10.48550/arXiv.1907.06772Brooker, S. A., Stephens, P. A., Whittingham, M. J., & Willis, S. G. (2020). Automated detection and classification of birdsong: An ensemble approach. Ecological Indicators, 117, 106609. https://doi.org/10.1016/j.ecolind.2020.106609Brown, A. (2024). Automatic processing of large-scale bioacoustic data using dynamic workflows . 10443537 Bytes. https://doi.org/10.25959/23246969.V2Cerna, M., & Harvey, A. F. (n.d.). The Fundamentals of FFT-Based Signal Analysis and Measurement .COSEWIC. (2012). COSEWIC assessment and status report on the Western Toad Anaxyrus boreas in Canada. Committee on the Status of Endangered Wildlife in Canada. Ottawa., xiv + 71 pp. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning . MIT Press. http://www.deeplearningbook.orgGreenberg, S., Godin, T., & Whittington, J. (2019). Design patterns for wildlife‐related camera trap image analysis. Ecology and Evolution, 9 (24), 13706–13730. https://doi.org/10.1002/ece3.5767Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically (No. arXiv:1712.00409). arXiv. https://doi.org/10.48550/arXiv.1712.00409Hutter, F., Kotthoff, L., & Vanschoren, J. (Eds.). (2019). Automated Machine Learning: Methods, Systems, Challenges . Springer International Publishing. https://doi.org/10.1007/978-3-030-05318-5K. V. S. N., R. R., Montgomery, J., Garg, S., & Charleston, M. (2020). Bioacoustics Data Analysis – A Taxonomy, Survey and Open Challenges. IEEE Access, 8, 57684–57708. https://doi.org/10.1109/ACCESS.2020.2978547Kapoor, S., & Narayanan, A. (2022). Leakage and the Reproducibility Crisis in ML-based Science (No. arXiv:2207.07048). arXiv. http://arxiv.org/abs/2207.07048Knight, E. C., & Bayne, E. M. (2019). Classification threshold and training data affect the quality and utility of focal species data processed with automated audio-recognition software. Bioacoustics, 28 (6), 539–554. https://doi.org/10.1080/09524622.2018.1503971Knight, E. C., Hannah, K. C., Foley, G. J., Scott, C. D., Brigham, R. M., & Bayne, E. (2017). Recommendations for acoustic recognizer performance assessment with application to five common automated signal recognition programs. Avian Conservation and Ecology, 12 (2), art14. https://doi.org/10.5751/ACE-01114-120214Knight, E. C., Poo Hernandez, S., Bayne, E. M., Bulitko, V., & Tucker, B. V. (2020). Pre-processing spectrogram parameters improve the accuracy of bioacoustic classification using convolutional neural networks. Bioacoustics, 29 (3), 337–355. https://doi.org/10.1080/09524622.2019.1606734MacPhail, A. G., Yip, D. A., Knight, E. C., Hedley, R., Knaggs, M., Shonfield, J., Upham-Mills, E., & Bayne, E. M. (2024). Audio data compression affects acoustic indices and reduces detections of birds by human listening and automated recognisers. Bioacoustics, 33 (1), 74–90. https://doi.org/10.1080/09524622.2023.2290718Marinescu, F. (2002). EJB design patterns: Advanced patterns, processes, and idioms . Wiley.McCool, C., Beattie, J., Milford, M., Bakker, J. D., Moore, J. L., & Firn, J. (2018). Automating analysis of vegetation with computer vision: Cover estimates and classification. Ecology and Evolution, 8 (12), 6005–6015. https://doi.org/10.1002/ece3.4135Mesaros, A., Heittola, T., & Virtanen, T. (2016). Metrics for Polyphonic Sound Event Detection. Applied Sciences, 6 (6), 162. https://doi.org/10.3390/app6060162Shonfield, J., & Bayne, E. M. (2017). Autonomous recording units in avian ecological research: Current use and future applications. Avian Conservation and Ecology, 12 (1), art14. https://doi.org/10.5751/ACE-00974-120114Stowell, D. (2022). Computational bioacoustics with deep learning: A review and roadmap. PeerJ, 10, e13152. https://doi.org/10.7717/peerj.13152Stowell, D., Wood, M. D., Pamuła, H., Stylianou, Y., & Glotin, H. (2019). Automatic acoustic detection of birds through deep learning: The first Bird Audio Detection challenge. Methods in Ecology and Evolution, 10 (3), 368–380. https://doi.org/10.1111/2041-210X.13103Trethewey, M. W. (2000). WINDOW AND OVERLAP PROCESSING EFFECTS ON POWER ESTIMATES FROM SPECTRA. Mechanical Systems and Signal Processing, 14 (2), 267–278. https://doi.org/10.1006/mssp.1999.1274 Information & Authors Information Version history Copyright This work is licensed under a Non Exclusive No Reuse License.

Keywords

Authors Metrics & Citations Metrics Article Usage 230views 98downloads Citations Download citation Gavin Hurd, Robin Baron, Jesse Whittington. Design Patterns for the Development and Implementation of Bioacoustic Deep Learning Recognizers. Authorea. 06 January 2026. DOI: https://doi.org/10.22541/au.176770507.72237272/v1 DOI: https://doi.org/10.22541/au.176770507.72237272/v1 If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download. For more information or tips please see 'Downloading to a citation manager' in the Help menu.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00