Perceptual Autoencoder and Exemplar Selection for Lifelong Learning in Convolutional Neural Networks (CNNs)

doi:10.21203/rs.3.rs-4146505/v1

Perceptual Autoencoder and Exemplar Selection for Lifelong Learning in Convolutional Neural Networks (CNNs)

2024 · doi:10.21203/rs.3.rs-4146505/v1

preprint OA: closed

Full text JSON View at publisher

Full text 129,589 characters · extracted from preprint-html · click to expand

Perceptual Autoencoder and Exemplar Selection for Lifelong Learning in Convolutional Neural Networks (CNNs) | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Perceptual Autoencoder and Exemplar Selection for Lifelong Learning in Convolutional Neural Networks (CNNs) Hermawan Nugroho, Gee Yang Tay, Swaraj Dube This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4146505/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Lifelong learning or incremental learning in convolutional neural networks (CNNs) has encountered a challenge known as catastrophic forgetting, which impairs model performance when tasks are presented sequentially. While a simple approach of retraining the model with all previously seen training data can alleviate this issue to some extent, it is not scalable due to the rapid accumulation of storage requirements and retraining time. To address this challenge, we propose a novel incremental learning strategy involving image data generation and exemplar selection. Specifically, we introduce a new type of autoencoder called the Perceptual Autoencoder, which reconstructs previously seen data while significantly compressing it, requiring no retraining when new classes are introduced. The latent feature map from the undercomplete Perceptual Autoencoder is stored and utilized to reconstruct training data for replay alongside new class data when necessary. Additionally, we employ example forgetting as an exemplar detection metric for exemplar selection, aiming to minimize the number of old task training data while preserving model performance. Our proposed strategy achieves state-of-the-art performance on both CIFAR-100 and ImageNet-100 datasets. incremental learning Convolutional Neural Network perceptual autoencoder exemplar selection Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 1. Introduction Convolutional neural networks (CNNs) have revolutionized image classification, achieving remarkable accuracy on datasets like ImageNet (Fei-Fei et al. 2010 ), CIFAR-10 (Krizhevsky 2009 ), and CIFAR-100 (Krizhevsky 2009 ). Deep learning has gone beyond image recognition, even surpassing human performance. However, these achievements rely on a crucial assumption: all training data is available upfront, and all tasks are known during training. In real-world applications, models often need to learn new tasks after deployment. Unfortunately, current deep learning models are typically task specific. Simply retraining a model on new data when a new task arises isn't effective. For example, a CNN trained to classify Toyota car images will not automatically learn to recognize Suzuki cars or even new Toyota models after deployment (Kemker et al. 2018 ). Incremental learning in deep neural networks addresses this limitation. It allows a model to continuously learn new knowledge or tasks while retaining previously acquired knowledge. This concept is crucial for deploying models in dynamic environments (Parisi et al. 2019 ). The significance of incremental learning or lifelong learning, however, is often underestimated. In the realm of deep learning for computer vision, major milestones typically involve non-incremental tasks where all data classes are known in advance (Folly 2017 ). However, real-world applications of deep convolutional neural networks often require updates for new tasks after the initial training phase. One practical challenge faced in these applications is catastrophic forgetting. This phenomenon refers to the sudden decline in performance on previously learned tasks when the model is updated with new tasks. The model’s parameters become specifically adapted to the new task, leading to degradation in performance on older tasks. The term ‘catastrophic forgetting’ was introduced by McCloskey and Cohen (McCloskey and Cohen 1989 ). Incremental learning aims to strike a balance between retaining old knowledge and integrating new knowledge from novel tasks. However, these two objectives often conflict with each other. For instance, fine-tuning only the last fully connected layer of a neural network while keeping other layers constant can preserve previous knowledge but may hinder the network’s ability to learn new tasks effectively. Another approach involves replaying all previously seen training data, but this becomes impractical as the number of sequential tasks increases. Real-world constraints, such as storage limitations and data privacy policies, often render previous data inaccessible. Additionally, retraining the model with all old examples for each new task significantly increases computational time. In our research, we address incremental learning challenges by combining two distinct deep learning approaches. Our focus is on enhancing the quality and quantity of old task training data while minimizing retraining costs. Specifically, we propose the following strategies: Image Data Generation: We introduce a novel perceptual autoencoder capable of generating images belonging to specific classes. This approach addresses privacy concerns and provides synthetic data for training. Exemplar Selection: By selecting a subset of training data, we reduce training time in sequential tasks. Our goal is to minimize performance degradation caused by using a smaller amount of historical training data. 2. Literature review In the context of deep learning, addressing the issue of catastrophic forgetting involves employing various strategies. It can be categorized into three distinct approaches; 1) Replay-based methods aim to replicate the training effect of using all past data (the naive approach) without storing and using all of it. Examples include rehearsal, where a subset of old task data is periodically replayed during training for new tasks, and pseudo-rehearsal, which generates synthetic data resembling old tasks. 2) Regularization methods modify the objective function, the core formula used for training, by adding terms that encourage the model to retain previously learned knowledge. These additional terms act as a form of control, preventing the model from completely forgetting old tasks while learning new ones. 3) Parameter isolation methods focus on the model's internal parameters (weights and biases) and their importance for different tasks. By estimating this importance, the model can selectively update specific parameters depending on the current task. This allows for focused learning without sacrificing past knowledge. 2.1. Reply-based Method for Incremental Learning As the most notable work in class incremental learning, iCaRL (Rebuffi et al. 2017 ) stores a subset of exemplar training data copies by selecting data with feature map that has high similarity to the mean feature map of each class. This approach is restricted to a set memory budget, meaning that old classes are re-selected according to the same criteria to fit new classes. This work also suggested the use of knowledge distillation loss (Hinton et al. 2015 ) from previous trained model to preserve the performance of old task when training new task. However, in (Javed and Shafait 2019 ) the author provided a detailed analysis for each contribution claimed by iCaRL (Rebuffi et al. 2017 ) and provided compelling experimental evidence to disprove the effectiveness of the strategies. Firstly, (Javed and Shafait 2019 ) showed that the exemplar selecting method does not perform better than random sampling. Next, a standard CNN architecture with FC layer is shown to be able to perform at least as good as the iCaRL Nearest-Mean-of-Exemplars Classification. Lastly, knowledge distillation is shown to be the contributing factor in the stellar claimed by iCaRl(Onchis and Samuila 2021 ). 2.2. Pseudo Rehearsal Methods for Incremental Learning These methods utilize the generative capability of neural network to approximate previous task(Solinas et al. 2023 ). Deep Generative Replay(Shin et al. 2017 ) is the first paper to use generative model in continual learning to the best of our knowledge. A new Generative Adversarial Net is trained to generate pseudo examples for each incremental task. The author only reported success on low resolution simple dataset (32 x 32 x 3 resolution 0–9-digit images). Recently,(Xiang et al. 2019 ) uses a GAN(Goodfellow et al. 2014 ) creatively to mitigate the model collapsing problem and minimize retraining workload. Instead of generating entire images, the system only generates CNN feature maps that is much smaller in resolution and model collapse problem can be avoided. The discriminator of GAN(Goodfellow et al. 2014 ) is also used for multi-class image classification by attaching another FC layer at the end of the discriminator. However, all components of the system still require retraining when new task is added. Like in (Shin et al. 2017 ), this work only shows result from low resolution dataset (32 x 32 x 3 resolution). 2.3. Regularization Methods for Incremental Learning The method proposes extra regularization term in the lost function to conserve previous task learned knowledge when learning on new task. Regularization-based approach is crucial when storage of raw input is not possible, usually due to privacy reason and storage memory concern. Learning Without Forgetting(Li and Hoiem 2018 ) uses knowledge distillation to retain preceding tasks’ knowledge. Network outputs (SoftMax or logit) are recorded and used for next task training to distil knowledge. Distribution shifts with respect to the previously learned tasks can result in a gradual error build-up to the previous tasks as more differing tasks are included into the model. This error build-up also applies in a class-incremental setup, as shown in (Rebuffi et al. 2017 ). Elastic Weight Consolidation (Kirkpatrick et al. 2017 ), applies the Bayesian framework for neural networks which allows to find posterior distributions of parameters instead of mere point estimates in parameter space, by introducing uncertainty on the network parameters. 2.4. Parameter Isolation Methods for Incremental Learning Parameter isolation-based methods suggest dividing the model parameter into different subset and each subset only used for one of the tasks (Han et al. 2023 ; Ma et al. 2023 ). PackNet(Mallya and Lazebnik 2018 ) iteratively assigns a subset of the parameters to each of the consecutive tasks by constituting a corresponding binary mask. For each new task, PackNet requires two training phases. First, the network is trained while fixing the parameters assigned to previous tasks. After the first training phase, a predefined proportion of the remaining non-fixed parameters is allotted to the new task, defined by a binary mask. Selection of the parameters is determined by highest magnitude, serving as indicator for parameter importance in this work. In a second training round, this subset of most important parameters is retrained. However, besides fixing all parameters of previous tasks, the remaining unassigned parameters are masked out. Although PackNet allows explicit allocation of network capacity to each task, it remains inherently limited in the number of tasks that can be assigned to a model. 3. Material and Method 3.1. Perceptual Autoencoder for Image Generation Autoencoders are a type of neural network that excels at dimensionality reduction. They consist of two parts: an encoder and a decoder. The encoder compresses the input data into a lower-dimensional representation using non-linear techniques. This compressed representation, often called a bottleneck feature map, captures the essential features of the input. The decoder then attempts to reconstruct the original input data from this compressed representation, minimizing reconstruction errors. This unsupervised learning process allows autoencoders to learn efficient data compression techniques. This is particularly valuable in lifelong learning, where managing data storage becomes increasingly important as the model encounters new information. However, autoencoders have a known limitation: reconstructed images often appear blurry. Research shows that this blurring phenomenon is frequently observed in autoencoder outputs, suggesting inherent limitations in the technique's ability to perfectly reconstruct complex data. (Gondara 2016 ; Yang et al. 2018 ). The blurry reconstructions are due to the loss function it typically use prioritizes overall image similarity rather than preserving fine-grained details essential for tasks like image classification. These blurry reconstructions lack the necessary details and wouldn't be effective training data. Figure 1 shows the architecture of the proposed perceptual autoencoder for image generation for class incremental learning. The perceptual autoencoder consists of an autoencoder and a discriminator. The discriminator is trained on large scale images dataset on image classification task. The discriminator CNN network is frozen, meaning that no parameters in discriminator is learning when the autoencoder is trained. The input images are processed in autoencoder, and a reconstructed image are generated. The input images and reconstructed images are then passed through the frozen discriminator to produce intermediate feature maps. The feature maps generated by both original and synthetic images are then compared using Mean Squared Error (MSE) and added to the loss function of conventional autoencoder. The discriminator network can be defined as a function of ${d}_{\theta }\left(x\right)$ where $\theta$ represents the parameters of the discriminator. The loss function of proposed approach now has 2 terms, one for pixel loss(from autoencoder) and another for feature loss(to preserve spatial information) as shown in Eq. 1 . Here, $x$ represents the input image (input to encoder) and ${x}^{{\prime }}$ represents the reconstructed image (output of decoder). $$L= {‖x-{x}^{{\prime }}‖}^{2}+{‖{d}_{\theta }\left(x\right)-{d}_{\theta }\left({x}^{{\prime }}\right)‖}^{2}$$ 1 The training goal of perceptual autoencoder is to minimize the MSE of both pixel loss and visual feature loss. As the discriminator is trained on image classification task that focuses on locating discriminatory visual features, the only condition for low feature loss to occur is that both input and output images are perceptually similar. The minimization of visual feature loss encourages the autoencoder to produce clear images by conserving spatial information in bottleneck feature maps. We examine feature maps from the discriminator CNN as part of the objective function in reconstructions. Feature maps from various layers within the CNN have varying effects on the mean squared error (MSE) loss objective function. We propose that feature maps from lower levels of the CNN, closer to the input, possess lesser classification capability compared to those from higher levels, deeper within the network. This architectural choice is driven by the principle that deeper neural networks tend to outperform shallower ones due to their increased number of parameters. With more parameters, the CNN can perform more complex mappings. Feature maps are transmitted hierarchically within the CNN, from shallow layers to deeper ones. The feature maps generated by convolution blocks in deeper layers can be interpreted as containing more condensed information, representing the contributions of all model parameters up to that point. By computing MSE on these condensed feature maps, the convergence of the loss value is more likely to signify successful generation of visually similar images. Further details and relevant experiments are presented in section 4. The latent feature map produced by encoder is the data prior that are needed to be stored. As seen from perceptual autoencoder architecture, we use a latent size of 8 x 56 x 56 for 224 x 224 x 3 input images (ImageNet (Fei-Fei et al. 2010 ) dataset) and latent size of 16 x 16 x 6 for 32 x 32 x 3 input images(CIFAR-100 (Krizhevsky 2009 ) dataset). The latent feature size dictates the amount of information that can be utilized for reconstructions. In this work, we manage to reduce the latent size of 224 x 224 x 3 input images to 6x smaller in terms of input images pixel numbers. This reduction in storage size enables more data to be saved for incremental learning use. For low resolution images in CIFAR-100 (Krizhevsky 2009 ), this architecture only manages to compress 2x before the compression starts deteriorating classification task accuracy. Please note that the discriminator remains static, hence the function ${d}_{\theta }\left(x\right)$ ) is constant and maps inputs to outputs consistently. Consequently, all previous derivations of mutual information remain applicable as long as the ${d}_{\theta }\left(x\right)$ function remains unchanged. As the discriminator is a CNN, input image patterns are captured by convolution kernels. These kernels process information hierarchically, generating feature maps as outputs. Each value within the feature maps is influenced by neighbouring pixels, forming patterns from the input images. The mean squared error (MSE) of these feature maps assumes that each pixel in the input images is dissimilar, thereby conserving the image patterns in the reconstructed images. This additional spatial information significantly contributes to clear image reconstruction. By providing spatial information as a supervised signal to the perceptual autoencoder, it can accurately reproduce finer details in their correct locations. 3.2. Exemplar Selection In addition to augmenting the availability of old task data, another objective of incremental learning is to mitigate the retraining cost. With the dataset size increasing, training the model with all observed examples can lead to a significant escalation in training time. To address this issue, an intuitive approach is to curtail the training examples. However, reducing the training data may result in performance deterioration, as the model is prone to overfitting when the training data fails to adequately represent the entire real data distribution. Exemplar selection methods operate under the assumption that not all training data contribute equally. These selection methods must be robust enough to minimize the performance decline caused by reduced training data (Rebuffi et al. 2017 ). In our exemplar selection, we calculate the prediction of forgetting event of the input sample. The forgetting event and the unforgettable examples are defined as follows; Forgetting event : the prediction output of a model is defined as: ${y}^{{\prime }}=\text{arg}\underset{\text{i}}{\text{max}}p\left({y}_{i}\right|x)$ where $x$ is the input sample, ${y}_{i}$ is the confidence score of samples $x$ at neuron $i$ of the SoftMax classification layer. We define a binary output that shows the correctness of the prediction of sample $x$ at every epoch $e$ as follows: ${acc}_{x}^{e}={(y}^{{\prime }}==\widehat{y})$ . A forgetting event is considered happened when ${acc}_{x}^{e}$ changes from 1 to 0 i.e., ${acc}_{x}^{e}>{acc}_{x}^{e+1}$ . This transition is referred to as forgetting. Unforgettable examples : Data samples that have experienced a forgetting event at least once are classified as forgettable. Samples that are never misclassified even once throughout the entire training phase are classified as unforgettable. However, if a sample is never correctly classified during training, then that sample does not qualify as unforgettable. In convolutional neural networks (CNNs) trained for classification tasks, the convolutional layers generate feature maps. These maps help distinguish different types of data points within a high-dimensional feature space. As the model trains and optimizes its objective function, the decision boundary – the line separating different classes – continually adjusts during backpropagation to reach the optimal classification performance. This dynamic decision boundary can occasionally lead to the misclassification of previously correctly classified samples, especially if they lie near the boundary. These samples are somewhat analogous to support vectors in Support Vector Machines (SVMs), as they play a role in shaping the decision boundary. However, the existence of such borderline cases also suggests that some data points might be less critical for the overall model performance. This is because redundant data samples, those that do not significantly influence the decision boundary, can potentially be removed without impacting model accuracy. While removing data can speed up training in incremental learning, it is crucial to maintain a balanced representation of different classes. Large-scale data removal can lead to imbalanced datasets, which can negatively affect classification performance. Previous research has shown that imbalanced datasets can harm classification accuracy. Imagine the decision boundary in a CNN as a line separating two classes. If one class has significantly fewer samples after data removal, some crucial "support vectors" (data points near the boundary) might be missing. These support vectors are vital for defining the optimal decision boundary. To address this challenge, we propose a balanced removal strategy. This strategy ensures an equal number of examples are removed from each class, regardless of the total amount of data being removed. For example, if you have a dataset with 10 classes and need to remove 1000 samples, this strategy will remove the first 100 samples from each class, maintaining a balanced representation for training. 4. Experiments and Discussions 4.1. Datasets For all the experiments, we use CIFAR-100 (Krizhevsky 2009 ) and ImageNet (Fei-Fei et al. 2010 ) datasets. CIFAR-100 (Krizhevsky 2009 ) dataset that contains images of size 32 x 32 x 3 having 60000 RGB images. There are 100 distinct classes in this dataset which contains 500 training examples and 100 testing examples for each class. ImageNet ILSVRC 2012 (Fei-Fei et al. 2010 ) is a standard dataset that contains 1.28 million training data and 50000 testing data having different resolutions of images. The common standardizing strategy is to resize the image dataset to a size of 224 x 224 x 3. Due to the huge size of the dataset, we use only a subset of classes from ImageNet (Fei-Fei et al. 2010 ) data. The subset that contains the first 100 class is referred as ImageNet-100 (Fei-Fei et al. 2010 ) and the subset that contains the first 10 class is referred as ImageNet-10 (Fei-Fei et al. 2010 ). 4.2. Perceptual autoencoder – Image Generation In this section, we conduct experiments on three architectural design choices (AlexNet, VGG, and ResNet) to examine their impact on the quality of training data generation. This ablation study aims to elucidate the contribution or limitation of each component in the Perceptual Autoencoder. We assess generation quality using two methods: visual inspection and model training accuracy. For high-resolution images, we directly inspect the generated images to gauge generation quality. For low-resolution images, we utilize the generated images to train a standard CNN architecture for image classification tasks and compare the model's performance with that of a model trained using all real examples. The disparity in model accuracy resulting from the use of real and generated images serves as a reflection of generation quality, as the ultimate objective of the perceptual autoencoder is to generate high-quality training data in incremental learning scenarios. Additionally, we conduct experiments in non-incremental learning settings to better discern the individual contributions of each component design in the perceptual autoencoder. 4.2.1. Image Quality and Accuracy This experiment explores how the choice of convolutional neural network (CNN) architecture for the discriminator affects image generation in a perceptual autoencoder. We investigate this in two parts: 1) Sensitivity to Spatial Variation: We test how different CNN architectures respond to variations in the spatial information of images. Architectures more sensitive to these variations are expected to produce feature maps containing richer details. These detailed feature maps, in turn, can guide the perceptual autoencoder towards generating sharper images during reconstruction. 2) Effect on Image Quality: We verify the connection between a discriminator's sensitivity and image quality. We hypothesize that using a CNN architecture less sensitive to spatial variations will lead to the perceptual autoencoder generating clearer images. This is because the discriminator might be less likely to penalize slight reconstruction errors that don't significantly alter the overall image structure. To test these hypotheses, we compare the performance of three distinct CNN architectures as the discriminator: AlexNet (Krizhevsky et al. 2012 ), VGG-16(Simonyan and Zisserman 2015 ) and ResNet-50 (He et al. 2016 ). These architectures offer varying levels of sensitivity to spatial variations, allowing us to observe the impact on the perceptual autoencoder's image generation capabilities. In the spatial variation sensitivity test, we introduce two simultaneous and random steps to the test images: translation and rotation. For translation, we apply a random vertical and horizontal translation of 20% of the total number of pixels. Regarding rotations, a random rotation within the range of 0 to 360 degrees is applied to the test images. Each test image has a 50% chance of undergoing these transformations. To maintain consistency, we utilize pretrained models from PyTorch (Paszke et al. 2019 ). Since all PyTorch pretrained models are trained on the ImageNet, this experiment employs ImageNet test images as the benchmark dataset. We compute accuracies three times and report the average accuracies. Additionally, a control group is included, tested using unprocessed test data, and their accuracies serve as reference baselines. In the second part of the experiment, we employ the three CNN architectures as the discriminator for the perceptual autoencoder. Image generation experiments are conducted using the higher resolution ImageNet dataset for easier visual inspections. The perceptual autoencoder is trained for three epochs using the Adam optimizer with a learning rate of 0.001. Feature maps from the last convolutional block are fed to the mean squared error (MSE) loss function for all models. Figure 4 supports our hypothesis about spatial variation sensitivity in CNNs. As expected, ResNet-50 exhibits the least performance degradation when applying rotations and translations to the test images. This suggests that ResNet-50 is less sensitive to these spatial variations compared to other models. Conversely, VGG-16 shows the most significant drop in accuracy, indicating a higher sensitivity to spatial variations. Interestingly, it's also worth noting that VGG-16 generated the sharpest images during the experiment. These observations align perfectly with our initial hypothesis in section 3.1: CNN architectures more sensitive to spatial variations tend to capture richer details in their feature maps, potentially leading to sharper image reconstructions in the perceptual autoencoder. Table 1 Classification accuracies of different CNN models reported on ImageNet-1K (Fei-Fei et al. 2010 ) dataset. CNN Model Accuracy (%) AlexNet (Krizhevsky et al. 2012 ) 62.50 VGG-16 (Simonyan and Zisserman 2015 ) 76.30 ResNet-50 (He et al. 2016 ) 79.26 Interestingly, despite ResNet-50 achieving the highest accuracy in image classification among the three architectures, it surprisingly produces blurry images. The quality of image generation does not align with the reported model prediction accuracy shown in Table 1 . This discrepancy highlights that a model’s accuracy does not necessarily correlate with the image generation quality of a perceptual autoencoder. The underlying reason for this phenomenon lies in the spatial variation sensitivity within the CNN architecture, which directly impacts the spatial information captured in the discriminator feature maps. Architectures like ResNet-50, which are robust in handling spatial variations, do not yield significantly different feature maps when an image is slightly translated or rotated. Consequently, they perform well in image classification tasks. However, this lack of diverse feature maps poses a challenge for autoencoders, as they require precise spatial information for image features. In the absence of supervisory signals related to high-precision spatial information, perceptual autoencoders behave similarly to standard autoencoders, resulting in blurry image generation. Surprisingly, VGG-16, despite having lower classification performance, outperforms ResNet-50 in image generation. This superiority can be attributed to VGG-16’s heightened sensitivity to spatial variations, allowing it to provide exact spatial information for small image features. Notably, prior research (Srivastava and Grill-Spector 2018 ) also reports similar findings, demonstrating that ResNet architectures are less affected by spatial variations compared to VGG-16. Ultimately, the critical factor influencing perceptual autoencoder lies in the spatial information supplied by the discriminator, rather than the model’s classification accuracy. 4.2.2. Size Compression In this study, the effect of compression rate on image generation quality is investigated. Specifically, we explore the effect of varying bottleneck latent feature map sizes on the quality of generated images. Our approach involves training a perceptual autoencoder using VGG-16 as the discriminator. The feature maps from the last convolutional block serve as input to the mean squared error (MSE) loss function. To optimize the model, we employ the Adam optimizer with a learning rate of 0.001. The perceptual autoencoder is trained separately for two image datasets: ImageNet and CIFAR-100. We train it for 5 epochs on ImageNet and 50 epochs on CIFAR-100. As a measure of compression, we experiment with different latent feature map sizes. These sizes act as proxies for the level of compression applied to the images. Our experiments involve two distinct image resolutions: CIFAR-100 (32 x 32 x 3) and ImageNet (224 x 224 x 3). The compression rate is quantified in terms of the number of pixels affected by the latent feature map size. Table 2 Compression rates for different image resolutions and its effect on image classification task training data quality. Dataset Image Resolution Number of Classes Compression Rate Original Image Format Accuracy (%) ImageNet-10 (Fei-Fei et al. 2010 ) 224 x 224 x 3 10 12x JPEG 75.10 ImageNet-10 (Fei-Fei et al. 2010 ) 224 x 224 x 3 10 6x JPEG 78.00 CIFAR-100 (Krizhevsky 2009 ) 32 x 32 x 3 10 3x PNG 71.10 CIFAR-100 (Krizhevsky 2009 ) 32 x 32 x 3 10 2x PNG 75.00 We explore how image compression within the perceptual autoencoder affects the performance of a downstream image classifier (ResNet-18 in this case). We use the standard ImageNet dataset with images of resolution 224 x 224 x 3. The results in Table 8 show a trade-off between compression ratio and classification accuracy. Compressing images by 12x (compared to 6x) leads to a 2.9% drop in classification performance. This is because the bottleneck feature map between the encoder and decoder acts as a compressed representation of the image. Reducing its size limits the amount of information it can retain for reconstruction. As seen in Table 8, significant performance degradation in the image classifier occurs when the compression ratio reaches 3x (reduction to a third of the original size). This suggests that the perceptual autoencoder can effectively compress ImageNet images by up to 2x without sacrificing significant classification accuracy. The effect of compression on image quality also depends on the dataset's original resolution. In datasets like CIFAR-100 with low-resolution images (32 x 32 x 3), each pixel value carries more weight for classification. Even minor distortions caused by compression can significantly impact accuracy compared to higher-resolution datasets. Fortunately, real-world image classification tasks rarely involve such low resolutions. The 600% compression rate achieved on ImageNet offers a more realistic scenario. By compressing a 100GB dataset by 6 times, we can reduce its size to a more manageable 16.6GB. With some tolerable performance loss, further compression down to 8.3GB might be possible. This approach not only addresses storage concerns in lifelong learning but also enables efficient dataset transfer over wireless networks. 4.3. Exemplar selection + perceptual autoencoder 4.3.1. Experiment setup for CIFAR-100 In this section, we primarily evaluate our proposed strategy against the approaches discussed in the work by (Li and Hoiem 2018 ), using similar experimental settings with the CIFAR-100 dataset. For these experiments, we train the base model using the first 10 classes of CIFAR-100, and then incrementally add the remaining classes in a uniform manner. We employ the ResNet architecture as our classifier to ensure consistency in comparison with Li and Hoiem’s work. Since the CIFAR-100 dataset contains only 100 classes, and we are incrementally adding 10 classes, we aim to assess the long-term effectiveness of our proposed incremental learning approach. Many methods experience performance degradation when dealing with long sequences of incremental learning tasks. We also report a baseline model accuracy trained using all available data, which serves as an upper bound for all incremental learning methods. We begin recording classification accuracy starting from class number 20, as there is no incremental learning occurring for the initial 10 classes. Our proposed methods are labelled as follows: 1) Perceptual Autoencoder, 2)Perceptual Autoencoder + Exemplar Selection 20%, and 3) Perceptual Autoencoder + Exemplar Selection 40%. The percentages (20% and 40%) in the labels represent the removal portion of old task examples using our balance removal strategy. Additionally, we report a baseline accuracy obtained by training the model using all available training data. This baseline accuracy serves as an upper boundary for performance comparison. During each incremental step, we record the highest accuracy achieved, save the model, and use it as the initialized model for the next incremental step. Our perceptual autoencoder employs a 2x compression rate to ensure high generation quality. It is trained for 50 epochs using the first 10 base classes. We use the Adam optimizer with a learning rate of 0.001 during the training phase. All training data, including both old and new class data, is generated by the same perceptual autoencoder before being used to train the ResNet classifier. For the classifier, we train ResNet-50 with 200 epochs using the SGD optimizer. The learning rate starts at 0.1 and is reduced to 0.1 times the previous epoch value at epochs 60, 90, 120, and 160. For exemplar selection, we employ the balance removal strategy. Each class in the dataset is filtered only once after training for the respective incremental step has concluded. The number of samples belonging to old classes remains constant once they are filtered. Figure 5 confirms the challenge of catastrophic forgetting in incremental learning, where performance drops as new tasks are added. However, our proposed methods show significant promise in mitigating this issue. The Perceptual Autoencoder without exemplar selection stands out. As seen in Table 10 (baseline column, last row), it achieves a performance degradation of only 7.6% compared to the baseline trained with all real data. This translates to the closest accuracy to the baseline among all methods (Fig. 5 , last point of each line). Furthermore, Table 10 demonstrates that our method surpasses current best pseudo-rehearsal methods like incGAN(Li and Hoiem 2018 ) by 8.1%. This validates our approach of maximizing pseudo-examples through image generation. It essentially applies the well-established deep learning principle of leveraging more training data to incremental learning. Unlike incGAN which requires retraining a Generative Adversarial Network (GAN) for each new task, our method achieves competitive performance without retraining the image generation component. Even with exemplar selection settings (20% and 40% balanced removal), as shown in Fig. 5 , our methods outperform incGAN, which also utilizes a high number of pseudo-examples. This advantage persists as the model encounters more incremental steps, demonstrating its suitability as a long-term strategy. Our data-oriented approach achieves state-of-the-art performance while minimizing the number of old task samples required. This is possible due to the effectiveness of our high-fidelity pseudo-examples and well-chosen data exemplars. 4.3.2. Experiment setup for ImageNet In this experiment, we conduct a comparative analysis of our approach with another method applied to the ImageNet dataset, as reported by Wu et al. in (Wu et al. 2019 ). Both our method and the one in Wu et al. ( 2019 ) are rehearsal-based approaches that store real training examples. Notably, our proposed method introduces a novel pseudo rehearsal strategy, which, to the best of our knowledge, is the first of its kind for evaluating performance on the ImageNet dataset. Instead of reporting top-1 accuracy, we focus on top-5 accuracy, aligning with Wu et al. in (Wu et al. 2019 )., who exclusively reported top-5 accuracy for ImageNet. Our proposed methods are labelled as follows: 1) Perceptual Autoencoder, 2) Perceptual Autoencoder + Exemplar Selection 20%, and 3) Perceptual Autoencoder + Exemplar Selection 40%. The percentages (20% and 40%) in the labels represent the removal portion of old task examples using our balance removal strategy. Additionally, we include a baseline accuracy obtained by training the model using all available training data. This baseline accuracy serves as an upper boundary for performance comparison. In our experiment, we utilize the first 100 classes of the ImageNet dataset as our benchmarking dataset. The incremental step involves adding 10 classes per step, resulting in a gradual expansion of the training set. Like our previous experiments, we train the perceptual autoencoder using the initial 10 classes. Notably, the autoencoder is trained only once throughout the entire duration of this experiment. All training data, including both old and new class data, is generated by the same perceptual autoencoder before being used to train the ResNet classifier. Here are the specific learning settings for the Perceptual Autoencoder in this experiment: Optimizer: Adam, Epochs: 3, Compression Rate: 6x, Learning Rate: 0.001 and Feature Maps for Mean Squared Error (MSE): Feature map 5. As for the classifier, we employ ResNet-18 as the baseline model. All other classifier training settings mirror those described in Wu et al. ( 2019 ). For exemplar selection, we utilize the balance removal exemplar selection strategy. Figure 6 confirms the effectiveness of our proposed method. It achieves the highest accuracy among all compared methods in this lifelong learning scenario on the ImageNet dataset. The core strength of our approach lies in the high quality of image reconstructions generated by the perceptual autoencoder. Our method without exemplar selection achieves an impressive accuracy of 90.1%, only 5.1% lower than the baseline trained with all real data. This demonstrates the effectiveness of generated data when it faithfully preserves discriminative information crucial for classification. With a 20% removal rate using balanced removal, our method maintains superior performance compared to existing methods like BiC (Wu et al. 2019 ). While BiC outperforms our method with 40% removal, it's important to note that BiC is a rehearsal method that relies on storing real training data, which can be a storage bottleneck. Our combination of perceptual autoencoder and balanced removal consistently outperforms iCaRL across all settings. This highlights the advantage of our data-oriented approach in lifelong learning. The well-established principle in deep learning that "more data is better" applies even to generated data. Our method demonstrates that the perceptual autoencoder can create informative data beneficial for classification tasks. The balanced removal strategy effectively maintains model performance even with a high removal rate (40%). The gap between the baseline and the 40% removal setting is only 11.3%. This translates to a potential 40% reduction in retraining time while sacrificing a manageable 10% in accuracy. This makes our method a practical choice for scenarios where high accuracy is less critical. The early saturation of performance degradation in our method suggests its suitability for large-scale, long-term incremental learning. Its performance remains stable even with a growing number of tasks. Our method, categorized as a pseudo-rehearsal method, achieves competitive results with rehearsal methods that have access to original task data. This is possible because the perceptual autoencoder generates high-fidelity images (resolution: 224 x 224 x 3) that effectively capture the necessary information for classification. 5. Conclusion This work introduces two novel methods to address challenges in class-incremental learning: image generation and exemplar selection; 1) Perceptual Autoencoder for Efficient Data Representation: We propose a new perceptual autoencoder that efficiently reconstructs images from a significantly reduced (6x smaller) feature map. This compressed representation minimizes distortion and avoids the need for hyperparameter tuning across different datasets. The small size of the latent feature map also helps alleviate storage concerns in incremental learning scenarios. 2) Generalizable Image Generation: Our approach enables the generation model to work effectively on various datasets and image resolutions using the same discriminator network. This demonstrates the generalizability of our strategy across different training data distributions. Furthermore, the method requires minimal hyperparameter tuning and avoids retraining, making it suitable for non-stationary training data encountered in incremental learning. We introduce an exemplar selection method that builds upon and improves existing work on mitigating example forgetting. This method is not only applicable to incremental learning but also offers an improvement over the original approach through a balanced removal strategy. Our proposed methods achieve state-of-the-art performance on both the CIFAR-100 and ImageNet-100 datasets for image classification tasks with incremental learning. By comparing our results with recent advancements in the field, we ensure the competitiveness of our strategy against existing methods. To the best of our knowledge, our work presents the first pseudo-rehearsal method that reports incremental learning performance using the ImageNet-100 dataset. This benchmark result can serve as a valuable baseline for future research on pseudo-rehearsal approaches in incremental learning. Our methods focus on improving the training data itself, rather than introducing entirely new CNN models specifically designed for incremental learning. This approach has the advantage of being applicable to any CNN architecture used for image classification, as it does not require modifications to the classification model itself. We propose that this data-oriented approach tackles a more general problem compared to methods that alter CNN architectures. Additionally, future advancements in image classification architectures would not render our approach obsolete. Declarations Ethical Approval Not Applicable Competing interests The authors declare that the paper is free from known competing financial interests or personal relationships that could affect the works reported. Authors' contributions Tay Gee Yang conducted the experiments. Swaraj Dube wrote the draft. Hermawan Nugroho supervised and reviewed the article. Funding The project is supported by Department of Electrical and Electronic Engineering, University of Nottingham Malaysia. Availability of data and materials The datasets used in in the study are available to readers. The datasets are deposited in publicly available repositories. References Fei-Fei L, Deng J, Li K (2010) ImageNet: Constructing a large-scale image database. J Vis 9:1037–1037. https://doi.org/10.1167/9.8.1037 Folly KA (2017) Diversity increasing methods in PBIL-application to power system controller design: a comparison. Nat Comput 16. https://doi.org/10.1007/s11047-016-9544-7 Gondara L (2016) Medical Image Denoising Using Convolutional Denoising Autoencoders. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). IEEE, pp 241–246 Goodfellow I, Pouget-Abadie J, Mirza M et al (2014) Generative Adversarial Nets. Neural Information Processing Systems, NIPS 2014. https://doi.org/10.1109/ICCVW.2019.00369 Han J, Liu Z, Li Y, Zhang T (2023) SCMP-IL: an incremental learning method with super constraints on model parameters. Int J Mach Learn Cybernet 14. https://doi.org/10.1007/s13042-022-01725-1 He K, Zhang X, Ren S, Sun J Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and, Recognition P (2016) (CVPR). IEEE, pp 770–778 Hinton G, Vinyals O, Dean J (2015) Distilling the Knowledge in a Neural Network. 1–9 Javed K, Shafait F (2019) Revisiting Distillation and Incremental Classifier Learning. pp 3–17 Kemker R, McClure M, Abitino A et al (2018) Measuring catastrophic forgetting in neural networks. 32nd AAAI Conference on Artificial Intelligence, AAAI 2018 3390–3398 Kirkpatrick J, Pascanu R, Rabinowitz N et al (2017) Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci U S A 114:3521–3526. https://doi.org/10.1073/pnas.1611835114 Krizhevsky A (2009) Learning multiple layers of features from tiny images Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Neural Inform Process Syst 1106–1114 Li Z, Hoiem D (2018) Learning without Forgetting. IEEE Trans Pattern Anal Mach Intell 40:2935–2947. https://doi.org/10.1109/TPAMI.2017.2773081 Ma R, Wu Q, Ngan KN et al (2023) Forgetting to Remember: A Scalable Incremental Learning Framework for Cross-Task Blind Image Quality Assessment. IEEE Trans Multimedia 25. https://doi.org/10.1109/TMM.2023.3242143 Mallya A, Lazebnik S (2018) PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 7765–7773. https://doi.org/10.1109/CVPR.2018.00810 McCloskey M, Cohen NJ (1989) Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychology of Learning and Motivation. - Adv Res Theory 24:109–165. https://doi.org/10.1016/S0079-7421(08)60536-8 Onchis DM, Samuila IV (2021) Double distillation for class incremental learning. In: Proceedings – 2021 23rd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2021 Parisi GI, Kemker R, Part JL et al (2019) Continual lifelong learning with neural networks: A review. Neural Netw 113:54–71. https://doi.org/10.1016/j.neunet.2019.01.012 Paszke A, Gross S, Massa F et al (2019) PyTorch: An Imperative Style. High-Performance Deep Learning Library Rebuffi SA, Kolesnikov A, Sperl G, Lampert CH (2017) iCaRL: Incremental classifier and representation learning. Proceedings – 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 2017-Janua:5533–5542. https://doi.org/10.1109/CVPR.2017.587 Shin H, Lee JK, Kim J, Kim J (2017) Continual learning with deep generative replay. Adv Neural Inf Process Syst 2017–Decem:2991–3000 Simonyan K, Zisserman A (2015) Very Deep Convolutional Networks For Large-Scale Image Recognition. International Conference on Learning Representations Solinas M, Reyboz M, Rousset S et al (2023) On the Beneficial Effects of Reinjections for Continual Learning. SN Comput Sci 4. https://doi.org/10.1007/s42979-022-01392-7 Srivastava M, Grill-Spector K (2018) The Effect of Learning Strategy versus Inherent Architecture Properties on the Ability of Convolutional Neural Networks to Develop Transformation Invariance. ArXiv Wu Y, Chen Y, Wang L et al (2019) Large Scale Incremental Learning. ArXiv 374–382 Xiang Y, Fu Y, Ji P, Huang H (2019) Incremental learning using conditional adversarial networks. Proceedings of the IEEE International Conference on Computer Vision 2019-Octob:6618–6627. https://doi.org/10.1109/ICCV.2019.00672 Yang Y, Wu QMJ, Wang Y (2018) Autoencoder With Invertible Functions for Dimension Reduction and Image Reconstruction. IEEE Trans Syst Man Cybern Syst 48:1065–1079. https://doi.org/10.1109/TSMC.2016.2637279 Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4146505","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":282743993,"identity":"76a93fa9-4a73-4151-9501-0aae148cf0c3","order_by":0,"name":"Hermawan Nugroho","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABTElEQVRIie3Rv0rDQBzA8V8IXJeWOkbUvoFwJXBShPZVegTi0ooglICDESFTimseI1KIuEUOmsGjXa+kYEXsooW4KVgxsdS22oJugvlO94cPd9wBpKX9wfCXuQ/56fpkC/2ArJsgmf6vCPYTAqvJTqZ5775Y5X3YsNUHw+gX1PA0uIsO+uWLzHUVogaDbac6T0p2QHpNSzuETU5KnA9V0m/T+GJD7dKuu5LTYUDEAsFCRyLnydRUaqR4YjHqiVoxJkzDft2Vc9Z3cjNEvbF3PCMtZ0q6I1ceLyECoTDnsYSotwlxlQkpYxGfIi0hXEfh1ltALUVvSCZnqiJ06nDMqliM3Cu7s5clfLBAgjbqPfIjeqZorSfTYIW8o/mR8coquFs/Hzw3dgskWDjls/j9kTK/QM2PbwLIwtpyEidH87PKbJj3V5G0tLS0f9E7BOyQ1p540qAAAAAASUVORK5CYII=","orcid":"","institution":"University of Nottingham Malaysia Campus","correspondingAuthor":true,"prefix":"","firstName":"Hermawan","middleName":"","lastName":"Nugroho","suffix":""},{"id":282743994,"identity":"7d486de3-498e-47e5-82b8-af3ef0203c23","order_by":1,"name":"Gee Yang Tay","email":"","orcid":"","institution":"University of Nottingham Malaysia Campus","correspondingAuthor":false,"prefix":"","firstName":"Gee","middleName":"Yang","lastName":"Tay","suffix":""},{"id":282743995,"identity":"21a53ed0-3802-451f-8d8a-be514a2d9710","order_by":2,"name":"Swaraj Dube","email":"","orcid":"","institution":"University of Nottingham Malaysia Campus","correspondingAuthor":false,"prefix":"","firstName":"Swaraj","middleName":"","lastName":"Dube","suffix":""}],"badges":[],"createdAt":"2024-03-22 02:29:25","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4146505/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4146505/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":53670336,"identity":"35c6e09d-ee9c-42df-a7b3-5fb3e8da3903","added_by":"auto","created_at":"2024-03-28 17:46:24","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":148575,"visible":true,"origin":"","legend":"\u003cp\u003eArchitecture of the Perceptual Autoencoder Model.\u003c/p\u003e","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-4146505/v1/da5850939d43c32de09169f4.png"},{"id":53670333,"identity":"eee99bf3-ce4d-40da-a20b-075648b800ab","added_by":"auto","created_at":"2024-03-28 17:46:24","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":191819,"visible":true,"origin":"","legend":"\u003cp\u003eMain building blocks of encoder and decoder of the perceptual autoencoder.\u003c/p\u003e","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-4146505/v1/0d8e9412b0a063a7046a9bd7.png"},{"id":53671090,"identity":"04761c50-ef7c-415d-8d1f-06722fc6bf3c","added_by":"auto","created_at":"2024-03-28 17:54:25","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":15563,"visible":true,"origin":"","legend":"\u003cp\u003e(a) Details of the convolutional block shown in \u003cstrong\u003eFig. 2\u003c/strong\u003e, (b) details of the up-sampling block shown in \u003cstrong\u003eFig. 2\u003c/strong\u003e.\u003c/p\u003e","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-4146505/v1/353a94b4093e6ccad704e7b4.png"},{"id":53670331,"identity":"2314233d-e6e9-4e74-84b4-757bf0cb98cc","added_by":"auto","created_at":"2024-03-28 17:46:23","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":59772,"visible":true,"origin":"","legend":"\u003cp\u003eExperiment results for spatial variation sensitivity: effect of spatial variation of test data on different CNN model architectures.\u003c/p\u003e","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-4146505/v1/0f32b62301701af18dff3754.png"},{"id":53670338,"identity":"fa854683-89a5-4412-97e5-c4f886509285","added_by":"auto","created_at":"2024-03-28 17:46:24","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":74587,"visible":true,"origin":"","legend":"\u003cp\u003eTop-5 incremental learning accuracies on CIFAR-100 (Krizhevsky 2009).\u003c/p\u003e","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-4146505/v1/7b9db55b72cd0d8719abbf46.png"},{"id":53670339,"identity":"76d2ec7f-89d6-4c53-8a62-31006f159460","added_by":"auto","created_at":"2024-03-28 17:46:25","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":74368,"visible":true,"origin":"","legend":"\u003cp\u003eTop-5 incremental learning accuracies on ImageNet-100 (Fei-Fei et al. 2010).\u003c/p\u003e","description":"","filename":"Onlinefloatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-4146505/v1/5466728d2b98d525103426ba.png"},{"id":53852256,"identity":"25c23cf1-f4cf-4e4b-a190-f5f1b5348eb3","added_by":"auto","created_at":"2024-04-01 10:20:15","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1072809,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4146505/v1/ba76d80e-11f3-475e-a224-ec5ad9f1f6c8.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Perceptual Autoencoder and Exemplar Selection for Lifelong Learning in Convolutional Neural Networks (CNNs)","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eConvolutional neural networks (CNNs) have revolutionized image classification, achieving remarkable accuracy on datasets like ImageNet (Fei-Fei et al. \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2010\u003c/span\u003e), CIFAR-10 (Krizhevsky \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2009\u003c/span\u003e), and CIFAR-100 (Krizhevsky \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2009\u003c/span\u003e). Deep learning has gone beyond image recognition, even surpassing human performance. However, these achievements rely on a crucial assumption: all training data is available upfront, and all tasks are known during training. In real-world applications, models often need to learn new tasks after deployment. Unfortunately, current deep learning models are typically task specific. Simply retraining a model on new data when a new task arises isn't effective. For example, a CNN trained to classify Toyota car images will not automatically learn to recognize Suzuki cars or even new Toyota models after deployment (Kemker et al. \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2018\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIncremental learning in deep neural networks addresses this limitation. It allows a model to continuously learn new knowledge or tasks while retaining previously acquired knowledge. This concept is crucial for deploying models in dynamic environments (Parisi et al. \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). The significance of incremental learning or lifelong learning, however, is often underestimated. In the realm of deep learning for computer vision, major milestones typically involve non-incremental tasks where all data classes are known in advance (Folly \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2017\u003c/span\u003e). However, real-world applications of deep convolutional neural networks often require updates for new tasks after the initial training phase. One practical challenge faced in these applications is catastrophic forgetting. This phenomenon refers to the sudden decline in performance on previously learned tasks when the model is updated with new tasks. The model\u0026rsquo;s parameters become specifically adapted to the new task, leading to degradation in performance on older tasks. The term \u0026lsquo;catastrophic forgetting\u0026rsquo; was introduced by McCloskey and Cohen (McCloskey and Cohen \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e1989\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIncremental learning aims to strike a balance between retaining old knowledge and integrating new knowledge from novel tasks. However, these two objectives often conflict with each other. For instance, fine-tuning only the last fully connected layer of a neural network while keeping other layers constant can preserve previous knowledge but may hinder the network\u0026rsquo;s ability to learn new tasks effectively. Another approach involves replaying all previously seen training data, but this becomes impractical as the number of sequential tasks increases. Real-world constraints, such as storage limitations and data privacy policies, often render previous data inaccessible. Additionally, retraining the model with all old examples for each new task significantly increases computational time.\u003c/p\u003e \u003cp\u003eIn our research, we address incremental learning challenges by combining two distinct deep learning approaches. Our focus is on enhancing the quality and quantity of old task training data while minimizing retraining costs. Specifically, we propose the following strategies:\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eImage Data Generation: We introduce a novel perceptual autoencoder capable of generating images belonging to specific classes. This approach addresses privacy concerns and provides synthetic data for training.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eExemplar Selection: By selecting a subset of training data, we reduce training time in sequential tasks. Our goal is to minimize performance degradation caused by using a smaller amount of historical training data.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e"},{"header":"2. Literature review","content":"\u003cp\u003eIn the context of deep learning, addressing the issue of catastrophic forgetting involves employing various strategies. It can be categorized into three distinct approaches; 1) Replay-based methods aim to replicate the training effect of using all past data (the naive approach) without storing and using all of it. Examples include rehearsal, where a subset of old task data is periodically replayed during training for new tasks, and pseudo-rehearsal, which generates synthetic data resembling old tasks. 2) Regularization methods modify the objective function, the core formula used for training, by adding terms that encourage the model to retain previously learned knowledge. These additional terms act as a form of control, preventing the model from completely forgetting old tasks while learning new ones. 3) Parameter isolation methods focus on the model's internal parameters (weights and biases) and their importance for different tasks. By estimating this importance, the model can selectively update specific parameters depending on the current task. This allows for focused learning without sacrificing past knowledge.\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1. Reply-based Method for Incremental Learning\u003c/h2\u003e \u003cp\u003eAs the most notable work in class incremental learning, iCaRL (Rebuffi et al. \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) stores a subset of exemplar training data copies by selecting data with feature map that has high similarity to the mean feature map of each class. This approach is restricted to a set memory budget, meaning that old classes are re-selected according to the same criteria to fit new classes. This work also suggested the use of knowledge distillation loss (Hinton et al. \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2015\u003c/span\u003e) from previous trained model to preserve the performance of old task when training new task. However, in (Javed and Shafait \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) the author provided a detailed analysis for each contribution claimed by iCaRL (Rebuffi et al. \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) and provided compelling experimental evidence to disprove the effectiveness of the strategies. Firstly, (Javed and Shafait \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) showed that the exemplar selecting method does not perform better than random sampling. Next, a standard CNN architecture with FC layer is shown to be able to perform at least as good as the iCaRL Nearest-Mean-of-Exemplars Classification. Lastly, knowledge distillation is shown to be the contributing factor in the stellar claimed by iCaRl(Onchis and Samuila \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2021\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2. Pseudo Rehearsal Methods for Incremental Learning\u003c/h2\u003e \u003cp\u003eThese methods utilize the generative capability of neural network to approximate previous task(Solinas et al. \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Deep Generative Replay(Shin et al. \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) is the first paper to use generative model in continual learning to the best of our knowledge. A new Generative Adversarial Net is trained to generate pseudo examples for each incremental task. The author only reported success on low resolution simple dataset (32 x 32 x 3 resolution 0\u0026ndash;9-digit images). Recently,(Xiang et al. \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) uses a GAN(Goodfellow et al. \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2014\u003c/span\u003e) creatively to mitigate the model collapsing problem and minimize retraining workload. Instead of generating entire images, the system only generates CNN feature maps that is much smaller in resolution and model collapse problem can be avoided. The discriminator of GAN(Goodfellow et al. \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2014\u003c/span\u003e) is also used for multi-class image classification by attaching another FC layer at the end of the discriminator. However, all components of the system still require retraining when new task is added. Like in (Shin et al. \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2017\u003c/span\u003e), this work only shows result from low resolution dataset (32 x 32 x 3 resolution).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3. Regularization Methods for Incremental Learning\u003c/h2\u003e \u003cp\u003eThe method proposes extra regularization term in the lost function to conserve previous task learned knowledge when learning on new task. Regularization-based approach is crucial when storage of raw input is not possible, usually due to privacy reason and storage memory concern. Learning Without Forgetting(Li and Hoiem \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) uses knowledge distillation to retain preceding tasks\u0026rsquo; knowledge. Network outputs (SoftMax or logit) are recorded and used for next task training to distil knowledge. Distribution shifts with respect to the previously learned tasks can result in a gradual error build-up to the previous tasks as more differing tasks are included into the model. This error build-up also applies in a class-incremental setup, as shown in (Rebuffi et al. \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2017\u003c/span\u003e). Elastic Weight Consolidation (Kirkpatrick et al. \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2017\u003c/span\u003e), applies the Bayesian framework for neural networks which allows to find posterior distributions of parameters instead of mere point estimates in parameter space, by introducing uncertainty on the network parameters.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.4. Parameter Isolation Methods for Incremental Learning\u003c/h2\u003e \u003cp\u003eParameter isolation-based methods suggest dividing the model parameter into different subset and each subset only used for one of the tasks (Han et al. \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Ma et al. \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). PackNet(Mallya and Lazebnik \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) iteratively assigns a subset of the parameters to each of the consecutive tasks by constituting a corresponding binary mask. For each new task, PackNet requires two training phases. First, the network is trained while fixing the parameters assigned to previous tasks. After the first training phase, a predefined proportion of the remaining non-fixed parameters is allotted to the new task, defined by a binary mask. Selection of the parameters is determined by highest magnitude, serving as indicator for parameter importance in this work. In a second training round, this subset of most important parameters is retrained. However, besides fixing all parameters of previous tasks, the remaining unassigned parameters are masked out. Although PackNet allows explicit allocation of network capacity to each task, it remains inherently limited in the number of tasks that can be assigned to a model.\u003c/p\u003e \u003c/div\u003e"},{"header":"3. Material and Method","content":"\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e3.1. Perceptual Autoencoder for Image Generation\u003c/h2\u003e \u003cp\u003eAutoencoders are a type of neural network that excels at dimensionality reduction. They consist of two parts: an encoder and a decoder. The encoder compresses the input data into a lower-dimensional representation using non-linear techniques. This compressed representation, often called a bottleneck feature map, captures the essential features of the input. The decoder then attempts to reconstruct the original input data from this compressed representation, minimizing reconstruction errors. This unsupervised learning process allows autoencoders to learn efficient data compression techniques. This is particularly valuable in lifelong learning, where managing data storage becomes increasingly important as the model encounters new information. However, autoencoders have a known limitation: reconstructed images often appear blurry. Research shows that this blurring phenomenon is frequently observed in autoencoder outputs, suggesting inherent limitations in the technique's ability to perfectly reconstruct complex data. (Gondara \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2016\u003c/span\u003e; Yang et al. \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). The blurry reconstructions are due to the loss function it typically use prioritizes overall image similarity rather than preserving fine-grained details essential for tasks like image classification. These blurry reconstructions lack the necessary details and wouldn't be effective training data.\u003c/p\u003e \u003cp\u003eFigure\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e shows the architecture of the proposed perceptual autoencoder for image generation for class incremental learning. The perceptual autoencoder consists of an autoencoder and a discriminator. The discriminator is trained on large scale images dataset on image classification task. The discriminator CNN network is frozen, meaning that no parameters in discriminator is learning when the autoencoder is trained. The input images are processed in autoencoder, and a reconstructed image are generated. The input images and reconstructed images are then passed through the frozen discriminator to produce intermediate feature maps. The feature maps generated by both original and synthetic images are then compared using Mean Squared Error (MSE) and added to the loss function of conventional autoencoder. The discriminator network can be defined as a function of \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${d}_{\\theta }\\left(x\\right)\$\u003c/span\u003e\u003c/span\u003e where \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\theta\$\u003c/span\u003e\u003c/span\u003e represents the parameters of the discriminator. The loss function of proposed approach now has 2 terms, one for pixel loss(from autoencoder) and another for feature loss(to preserve spatial information) as shown in Eq.\u0026nbsp;\u003cspan refid=\"Equ1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. Here, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$x\$\u003c/span\u003e\u003c/span\u003e represents the input image (input to encoder) and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${x}^{{\\prime }}\$\u003c/span\u003e\u003c/span\u003e represents the reconstructed image (output of decoder).\u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e\n$$L= {‖x-{x}^{{\\prime }}‖}^{2}+{‖{d}_{\\theta }\\left(x\\right)-{d}_{\\theta }\\left({x}^{{\\prime }}\\right)‖}^{2}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eThe training goal of perceptual autoencoder is to minimize the MSE of both pixel loss and visual feature loss. As the discriminator is trained on image classification task that focuses on locating discriminatory visual features, the only condition for low feature loss to occur is that both input and output images are perceptually similar. The minimization of visual feature loss encourages the autoencoder to produce clear images by conserving spatial information in bottleneck feature maps.\u003c/p\u003e \u003cp\u003eWe examine feature maps from the discriminator CNN as part of the objective function in reconstructions. Feature maps from various layers within the CNN have varying effects on the mean squared error (MSE) loss objective function. We propose that feature maps from lower levels of the CNN, closer to the input, possess lesser classification capability compared to those from higher levels, deeper within the network. This architectural choice is driven by the principle that deeper neural networks tend to outperform shallower ones due to their increased number of parameters. With more parameters, the CNN can perform more complex mappings. Feature maps are transmitted hierarchically within the CNN, from shallow layers to deeper ones. The feature maps generated by convolution blocks in deeper layers can be interpreted as containing more condensed information, representing the contributions of all model parameters up to that point. By computing MSE on these condensed feature maps, the convergence of the loss value is more likely to signify successful generation of visually similar images. Further details and relevant experiments are presented in section 4.\u003c/p\u003e \u003cp\u003eThe latent feature map produced by encoder is the data prior that are needed to be stored. As seen from perceptual autoencoder architecture, we use a latent size of 8 x 56 x 56 for 224 x 224 x 3 input images (ImageNet (Fei-Fei et al. \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2010\u003c/span\u003e) dataset) and latent size of 16 x 16 x 6 for 32 x 32 x 3 input images(CIFAR-100 (Krizhevsky \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2009\u003c/span\u003e) dataset). The latent feature size dictates the amount of information that can be utilized for reconstructions. In this work, we manage to reduce the latent size of 224 x 224 x 3 input images to 6x smaller in terms of input images pixel numbers. This reduction in storage size enables more data to be saved for incremental learning use. For low resolution images in CIFAR-100 (Krizhevsky \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2009\u003c/span\u003e), this architecture only manages to compress 2x before the compression starts deteriorating classification task accuracy.\u003c/p\u003e \u003cp\u003ePlease note that the discriminator remains static, hence the function \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${d}_{\\theta }\\left(x\\right)\$\u003c/span\u003e\u003c/span\u003e) is constant and maps inputs to outputs consistently. Consequently, all previous derivations of mutual information remain applicable as long as the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${d}_{\\theta }\\left(x\\right)\$\u003c/span\u003e\u003c/span\u003e function remains unchanged. As the discriminator is a CNN, input image patterns are captured by convolution kernels. These kernels process information hierarchically, generating feature maps as outputs. Each value within the feature maps is influenced by neighbouring pixels, forming patterns from the input images. The mean squared error (MSE) of these feature maps assumes that each pixel in the input images is dissimilar, thereby conserving the image patterns in the reconstructed images. This additional spatial information significantly contributes to clear image reconstruction. By providing spatial information as a supervised signal to the perceptual autoencoder, it can accurately reproduce finer details in their correct locations.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e3.2. Exemplar Selection\u003c/h2\u003e \u003cp\u003eIn addition to augmenting the availability of old task data, another objective of incremental learning is to mitigate the retraining cost. With the dataset size increasing, training the model with all observed examples can lead to a significant escalation in training time. To address this issue, an intuitive approach is to curtail the training examples. However, reducing the training data may result in performance deterioration, as the model is prone to overfitting when the training data fails to adequately represent the entire real data distribution. Exemplar selection methods operate under the assumption that not all training data contribute equally. These selection methods must be robust enough to minimize the performance decline caused by reduced training data (Rebuffi et al. \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2017\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIn our exemplar selection, we calculate the prediction of forgetting event of the input sample. The forgetting event and the unforgettable examples are defined as follows;\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cem\u003eForgetting event\u003c/em\u003e: the prediction output of a model is defined as: \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${y}^{{\\prime }}=\\text{arg}\\underset{\\text{i}}{\\text{max}}p\\left({y}_{i}\\right|x)\$\u003c/span\u003e\u003c/span\u003e where \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$x\$\u003c/span\u003e\u003c/span\u003e is the input sample, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${y}_{i}\$\u003c/span\u003e\u003c/span\u003e is the confidence score of samples \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$x\$\u003c/span\u003e\u003c/span\u003e at neuron \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$i\$\u003c/span\u003e\u003c/span\u003e of the SoftMax classification layer. We define a binary output that shows the correctness of the prediction of sample \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$x\$\u003c/span\u003e\u003c/span\u003e at every epoch \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$e\$\u003c/span\u003e\u003c/span\u003e as follows: \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${acc}_{x}^{e}={(y}^{{\\prime }}==\\widehat{y})\$\u003c/span\u003e\u003c/span\u003e. A forgetting event is considered happened when \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${acc}_{x}^{e}\$\u003c/span\u003e\u003c/span\u003e changes from 1 to 0 i.e., \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${acc}_{x}^{e}\u0026gt;{acc}_{x}^{e+1}\$\u003c/span\u003e\u003c/span\u003e. This transition is referred to as forgetting.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cem\u003eUnforgettable examples\u003c/em\u003e: Data samples that have experienced a forgetting event at least once are classified as forgettable. Samples that are never misclassified even once throughout the entire training phase are classified as unforgettable. However, if a sample is never correctly classified during training, then that sample does not qualify as unforgettable.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003eIn convolutional neural networks (CNNs) trained for classification tasks, the convolutional layers generate feature maps. These maps help distinguish different types of data points within a high-dimensional feature space. As the model trains and optimizes its objective function, the decision boundary \u0026ndash; the line separating different classes \u0026ndash; continually adjusts during backpropagation to reach the optimal classification performance. This dynamic decision boundary can occasionally lead to the misclassification of previously correctly classified samples, especially if they lie near the boundary. These samples are somewhat analogous to support vectors in Support Vector Machines (SVMs), as they play a role in shaping the decision boundary. However, the existence of such borderline cases also suggests that some data points might be less critical for the overall model performance. This is because redundant data samples, those that do not significantly influence the decision boundary, can potentially be removed without impacting model accuracy.\u003c/p\u003e \u003cp\u003eWhile removing data can speed up training in incremental learning, it is crucial to maintain a balanced representation of different classes. Large-scale data removal can lead to imbalanced datasets, which can negatively affect classification performance. Previous research has shown that imbalanced datasets can harm classification accuracy. Imagine the decision boundary in a CNN as a line separating two classes. If one class has significantly fewer samples after data removal, some crucial \"support vectors\" (data points near the boundary) might be missing. These support vectors are vital for defining the optimal decision boundary.\u003c/p\u003e \u003cp\u003eTo address this challenge, we propose a balanced removal strategy. This strategy ensures an equal number of examples are removed from each class, regardless of the total amount of data being removed. For example, if you have a dataset with 10 classes and need to remove 1000 samples, this strategy will remove the first 100 samples from each class, maintaining a balanced representation for training.\u003c/p\u003e \u003c/div\u003e"},{"header":"4. Experiments and Discussions","content":"\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e4.1. Datasets\u003c/h2\u003e \u003cp\u003eFor all the experiments, we use CIFAR-100 (Krizhevsky \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2009\u003c/span\u003e) and ImageNet (Fei-Fei et al. \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2010\u003c/span\u003e) datasets. CIFAR-100 (Krizhevsky \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2009\u003c/span\u003e) dataset that contains images of size 32 x 32 x 3 having 60000 RGB images. There are 100 distinct classes in this dataset which contains 500 training examples and 100 testing examples for each class. ImageNet ILSVRC 2012 (Fei-Fei et al. \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2010\u003c/span\u003e) is a standard dataset that contains 1.28\u0026nbsp;million training data and 50000 testing data having different resolutions of images. The common standardizing strategy is to resize the image dataset to a size of 224 x 224 x 3. Due to the huge size of the dataset, we use only a subset of classes from ImageNet (Fei-Fei et al. \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2010\u003c/span\u003e) data. The subset that contains the first 100 class is referred as ImageNet-100 (Fei-Fei et al. \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2010\u003c/span\u003e) and the subset that contains the first 10 class is referred as ImageNet-10 (Fei-Fei et al. \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2010\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e4.2. Perceptual autoencoder \u0026ndash; Image Generation\u003c/h2\u003e \u003cp\u003eIn this section, we conduct experiments on three architectural design choices (AlexNet, VGG, and ResNet) to examine their impact on the quality of training data generation. This ablation study aims to elucidate the contribution or limitation of each component in the Perceptual Autoencoder. We assess generation quality using two methods: visual inspection and model training accuracy. For high-resolution images, we directly inspect the generated images to gauge generation quality. For low-resolution images, we utilize the generated images to train a standard CNN architecture for image classification tasks and compare the model's performance with that of a model trained using all real examples. The disparity in model accuracy resulting from the use of real and generated images serves as a reflection of generation quality, as the ultimate objective of the perceptual autoencoder is to generate high-quality training data in incremental learning scenarios. Additionally, we conduct experiments in non-incremental learning settings to better discern the individual contributions of each component design in the perceptual autoencoder.\u003c/p\u003e \u003cdiv id=\"Sec13\" class=\"Section3\"\u003e \u003ch2\u003e4.2.1. Image Quality and Accuracy\u003c/h2\u003e \u003cp\u003eThis experiment explores how the choice of convolutional neural network (CNN) architecture for the discriminator affects image generation in a perceptual autoencoder. We investigate this in two parts: 1) Sensitivity to Spatial Variation: We test how different CNN architectures respond to variations in the spatial information of images. Architectures more sensitive to these variations are expected to produce feature maps containing richer details. These detailed feature maps, in turn, can guide the perceptual autoencoder towards generating sharper images during reconstruction. 2) Effect on Image Quality: We verify the connection between a discriminator's sensitivity and image quality. We hypothesize that using a CNN architecture less sensitive to spatial variations will lead to the perceptual autoencoder generating clearer images. This is because the discriminator might be less likely to penalize slight reconstruction errors that don't significantly alter the overall image structure.\u003c/p\u003e \u003cp\u003eTo test these hypotheses, we compare the performance of three distinct CNN architectures as the discriminator: AlexNet (Krizhevsky et al. \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2012\u003c/span\u003e), VGG-16(Simonyan and Zisserman \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2015\u003c/span\u003e) and ResNet-50 (He et al. \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2016\u003c/span\u003e). These architectures offer varying levels of sensitivity to spatial variations, allowing us to observe the impact on the perceptual autoencoder's image generation capabilities.\u003c/p\u003e \u003cp\u003eIn the spatial variation sensitivity test, we introduce two simultaneous and random steps to the test images: translation and rotation. For translation, we apply a random vertical and horizontal translation of 20% of the total number of pixels. Regarding rotations, a random rotation within the range of 0 to 360 degrees is applied to the test images. Each test image has a 50% chance of undergoing these transformations. To maintain consistency, we utilize pretrained models from PyTorch (Paszke et al. \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). Since all PyTorch pretrained models are trained on the ImageNet, this experiment employs ImageNet test images as the benchmark dataset. We compute accuracies three times and report the average accuracies. Additionally, a control group is included, tested using unprocessed test data, and their accuracies serve as reference baselines.\u003c/p\u003e \u003cp\u003eIn the second part of the experiment, we employ the three CNN architectures as the discriminator for the perceptual autoencoder. Image generation experiments are conducted using the higher resolution ImageNet dataset for easier visual inspections. The perceptual autoencoder is trained for three epochs using the Adam optimizer with a learning rate of 0.001. Feature maps from the last convolutional block are fed to the mean squared error (MSE) loss function for all models.\u003c/p\u003e \u003cp\u003eFigure\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e supports our hypothesis about spatial variation sensitivity in CNNs. As expected, ResNet-50 exhibits the least performance degradation when applying rotations and translations to the test images. This suggests that ResNet-50 is less sensitive to these spatial variations compared to other models. Conversely, VGG-16 shows the most significant drop in accuracy, indicating a higher sensitivity to spatial variations. Interestingly, it's also worth noting that VGG-16 generated the sharpest images during the experiment. These observations align perfectly with our initial hypothesis in section 3.1: CNN architectures more sensitive to spatial variations tend to capture richer details in their feature maps, potentially leading to sharper image reconstructions in the perceptual autoencoder.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eClassification accuracies of different CNN models reported on ImageNet-1K (Fei-Fei et al. \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2010\u003c/span\u003e) dataset.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCNN Model\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAccuracy (%)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAlexNet (Krizhevsky et al. \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2012\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e62.50\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eVGG-16 (Simonyan and Zisserman \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2015\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e76.30\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eResNet-50 (He et al. \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2016\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e79.26\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eInterestingly, despite ResNet-50 achieving the highest accuracy in image classification among the three architectures, it surprisingly produces blurry images. The quality of image generation does not align with the reported model prediction accuracy shown in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. This discrepancy highlights that a model\u0026rsquo;s accuracy does not necessarily correlate with the image generation quality of a perceptual autoencoder.\u003c/p\u003e \u003cp\u003eThe underlying reason for this phenomenon lies in the spatial variation sensitivity within the CNN architecture, which directly impacts the spatial information captured in the discriminator feature maps. Architectures like ResNet-50, which are robust in handling spatial variations, do not yield significantly different feature maps when an image is slightly translated or rotated. Consequently, they perform well in image classification tasks. However, this lack of diverse feature maps poses a challenge for autoencoders, as they require precise spatial information for image features.\u003c/p\u003e \u003cp\u003eIn the absence of supervisory signals related to high-precision spatial information, perceptual autoencoders behave similarly to standard autoencoders, resulting in blurry image generation. Surprisingly, VGG-16, despite having lower classification performance, outperforms ResNet-50 in image generation. This superiority can be attributed to VGG-16\u0026rsquo;s heightened sensitivity to spatial variations, allowing it to provide exact spatial information for small image features. Notably, prior research (Srivastava and Grill-Spector \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) also reports similar findings, demonstrating that ResNet architectures are less affected by spatial variations compared to VGG-16. Ultimately, the critical factor influencing perceptual autoencoder lies in the spatial information supplied by the discriminator, rather than the model\u0026rsquo;s classification accuracy.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section3\"\u003e \u003ch2\u003e4.2.2. Size Compression\u003c/h2\u003e \u003cp\u003eIn this study, the effect of compression rate on image generation quality is investigated. Specifically, we explore the effect of varying bottleneck latent feature map sizes on the quality of generated images. Our approach involves training a perceptual autoencoder using VGG-16 as the discriminator. The feature maps from the last convolutional block serve as input to the mean squared error (MSE) loss function. To optimize the model, we employ the Adam optimizer with a learning rate of 0.001. The perceptual autoencoder is trained separately for two image datasets: ImageNet and CIFAR-100. We train it for 5 epochs on ImageNet and 50 epochs on CIFAR-100. As a measure of compression, we experiment with different latent feature map sizes. These sizes act as proxies for the level of compression applied to the images.\u003c/p\u003e \u003cp\u003eOur experiments involve two distinct image resolutions: CIFAR-100 (32 x 32 x 3) and ImageNet (224 x 224 x 3). The compression rate is quantified in terms of the number of pixels affected by the latent feature map size.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eCompression rates for different image resolutions and its effect on image classification task training data quality.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDataset\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eImage Resolution\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNumber of Classes\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCompression Rate\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eOriginal Image Format\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eAccuracy (%)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eImageNet-10 (Fei-Fei et al. \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2010\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e224 x 224 x 3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e12x\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eJPEG\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e75.10\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eImageNet-10 (Fei-Fei et al. \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2010\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e224 x 224 x 3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e6x\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eJPEG\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e78.00\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCIFAR-100 (Krizhevsky \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2009\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e32 x 32 x 3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e3x\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePNG\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e71.10\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCIFAR-100 (Krizhevsky \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2009\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e32 x 32 x 3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e2x\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePNG\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e75.00\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eWe explore how image compression within the perceptual autoencoder affects the performance of a downstream image classifier (ResNet-18 in this case). We use the standard ImageNet dataset with images of resolution 224 x 224 x 3. The results in Table\u0026nbsp;8 show a trade-off between compression ratio and classification accuracy. Compressing images by 12x (compared to 6x) leads to a 2.9% drop in classification performance. This is because the bottleneck feature map between the encoder and decoder acts as a compressed representation of the image. Reducing its size limits the amount of information it can retain for reconstruction. As seen in Table\u0026nbsp;8, significant performance degradation in the image classifier occurs when the compression ratio reaches 3x (reduction to a third of the original size). This suggests that the perceptual autoencoder can effectively compress ImageNet images by up to 2x without sacrificing significant classification accuracy. The effect of compression on image quality also depends on the dataset's original resolution. In datasets like CIFAR-100 with low-resolution images (32 x 32 x 3), each pixel value carries more weight for classification. Even minor distortions caused by compression can significantly impact accuracy compared to higher-resolution datasets. Fortunately, real-world image classification tasks rarely involve such low resolutions. The 600% compression rate achieved on ImageNet offers a more realistic scenario. By compressing a 100GB dataset by 6 times, we can reduce its size to a more manageable 16.6GB. With some tolerable performance loss, further compression down to 8.3GB might be possible. This approach not only addresses storage concerns in lifelong learning but also enables efficient dataset transfer over wireless networks.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003e4.3. Exemplar selection\u0026thinsp;+\u0026thinsp;perceptual autoencoder\u003c/h2\u003e \u003cdiv id=\"Sec16\" class=\"Section3\"\u003e \u003ch2\u003e4.3.1. Experiment setup for CIFAR-100\u003c/h2\u003e \u003cp\u003eIn this section, we primarily evaluate our proposed strategy against the approaches discussed in the work by (Li and Hoiem \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2018\u003c/span\u003e), using similar experimental settings with the CIFAR-100 dataset. For these experiments, we train the base model using the first 10 classes of CIFAR-100, and then incrementally add the remaining classes in a uniform manner. We employ the ResNet architecture as our classifier to ensure consistency in comparison with Li and Hoiem\u0026rsquo;s work. Since the CIFAR-100 dataset contains only 100 classes, and we are incrementally adding 10 classes, we aim to assess the long-term effectiveness of our proposed incremental learning approach. Many methods experience performance degradation when dealing with long sequences of incremental learning tasks. We also report a baseline model accuracy trained using all available data, which serves as an upper bound for all incremental learning methods.\u003c/p\u003e \u003cp\u003eWe begin recording classification accuracy starting from class number 20, as there is no incremental learning occurring for the initial 10 classes. Our proposed methods are labelled as follows: 1) Perceptual Autoencoder, 2)Perceptual Autoencoder\u0026thinsp;+\u0026thinsp;Exemplar Selection 20%, and 3) Perceptual Autoencoder\u0026thinsp;+\u0026thinsp;Exemplar Selection 40%. The percentages (20% and 40%) in the labels represent the removal portion of old task examples using our balance removal strategy. Additionally, we report a baseline accuracy obtained by training the model using all available training data. This baseline accuracy serves as an upper boundary for performance comparison.\u003c/p\u003e \u003cp\u003eDuring each incremental step, we record the highest accuracy achieved, save the model, and use it as the initialized model for the next incremental step. Our perceptual autoencoder employs a 2x compression rate to ensure high generation quality. It is trained for 50 epochs using the first 10 base classes. We use the Adam optimizer with a learning rate of 0.001 during the training phase. All training data, including both old and new class data, is generated by the same perceptual autoencoder before being used to train the ResNet classifier. For the classifier, we train ResNet-50 with 200 epochs using the SGD optimizer. The learning rate starts at 0.1 and is reduced to 0.1 times the previous epoch value at epochs 60, 90, 120, and 160. For exemplar selection, we employ the balance removal strategy. Each class in the dataset is filtered only once after training for the respective incremental step has concluded. The number of samples belonging to old classes remains constant once they are filtered.\u003c/p\u003e \u003cp\u003eFigure\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e confirms the challenge of catastrophic forgetting in incremental learning, where performance drops as new tasks are added. However, our proposed methods show significant promise in mitigating this issue. The Perceptual Autoencoder without exemplar selection stands out. As seen in Table\u0026nbsp;10 (baseline column, last row), it achieves a performance degradation of only 7.6% compared to the baseline trained with all real data. This translates to the closest accuracy to the baseline among all methods (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, last point of each line). Furthermore, Table\u0026nbsp;10 demonstrates that our method surpasses current best pseudo-rehearsal methods like incGAN(Li and Hoiem \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) by 8.1%. This validates our approach of maximizing pseudo-examples through image generation. It essentially applies the well-established deep learning principle of leveraging more training data to incremental learning. Unlike incGAN which requires retraining a Generative Adversarial Network (GAN) for each new task, our method achieves competitive performance without retraining the image generation component. Even with exemplar selection settings (20% and 40% balanced removal), as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, our methods outperform incGAN, which also utilizes a high number of pseudo-examples. This advantage persists as the model encounters more incremental steps, demonstrating its suitability as a long-term strategy. Our data-oriented approach achieves state-of-the-art performance while minimizing the number of old task samples required. This is possible due to the effectiveness of our high-fidelity pseudo-examples and well-chosen data exemplars.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section3\"\u003e \u003ch2\u003e4.3.2. Experiment setup for ImageNet\u003c/h2\u003e \u003cp\u003eIn this experiment, we conduct a comparative analysis of our approach with another method applied to the ImageNet dataset, as reported by Wu et al. in (Wu et al. \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). Both our method and the one in Wu et al. (\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) are rehearsal-based approaches that store real training examples. Notably, our proposed method introduces a novel pseudo rehearsal strategy, which, to the best of our knowledge, is the first of its kind for evaluating performance on the ImageNet dataset. Instead of reporting top-1 accuracy, we focus on top-5 accuracy, aligning with Wu et al. in (Wu et al. \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2019\u003c/span\u003e)., who exclusively reported top-5 accuracy for ImageNet. Our proposed methods are labelled as follows: 1) Perceptual Autoencoder, 2) Perceptual Autoencoder\u0026thinsp;+\u0026thinsp;Exemplar Selection 20%, and 3) Perceptual Autoencoder\u0026thinsp;+\u0026thinsp;Exemplar Selection 40%. The percentages (20% and 40%) in the labels represent the removal portion of old task examples using our balance removal strategy. Additionally, we include a baseline accuracy obtained by training the model using all available training data. This baseline accuracy serves as an upper boundary for performance comparison.\u003c/p\u003e \u003cp\u003eIn our experiment, we utilize the first 100 classes of the ImageNet dataset as our benchmarking dataset. The incremental step involves adding 10 classes per step, resulting in a gradual expansion of the training set. Like our previous experiments, we train the perceptual autoencoder using the initial 10 classes. Notably, the autoencoder is trained only once throughout the entire duration of this experiment. All training data, including both old and new class data, is generated by the same perceptual autoencoder before being used to train the ResNet classifier. Here are the specific learning settings for the Perceptual Autoencoder in this experiment: Optimizer: Adam, Epochs: 3, Compression Rate: 6x, Learning Rate: 0.001 and Feature Maps for Mean Squared Error (MSE): Feature map 5. As for the classifier, we employ ResNet-18 as the baseline model. All other classifier training settings mirror those described in Wu et al. (\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). For exemplar selection, we utilize the balance removal exemplar selection strategy.\u003c/p\u003e \u003cp\u003eFigure\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e confirms the effectiveness of our proposed method. It achieves the highest accuracy among all compared methods in this lifelong learning scenario on the ImageNet dataset. The core strength of our approach lies in the high quality of image reconstructions generated by the perceptual autoencoder. Our method without exemplar selection achieves an impressive accuracy of 90.1%, only 5.1% lower than the baseline trained with all real data. This demonstrates the effectiveness of generated data when it faithfully preserves discriminative information crucial for classification. With a 20% removal rate using balanced removal, our method maintains superior performance compared to existing methods like BiC (Wu et al. \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). While BiC outperforms our method with 40% removal, it's important to note that BiC is a rehearsal method that relies on storing real training data, which can be a storage bottleneck. Our combination of perceptual autoencoder and balanced removal consistently outperforms iCaRL across all settings. This highlights the advantage of our data-oriented approach in lifelong learning. The well-established principle in deep learning that \"more data is better\" applies even to generated data. Our method demonstrates that the perceptual autoencoder can create informative data beneficial for classification tasks. The balanced removal strategy effectively maintains model performance even with a high removal rate (40%). The gap between the baseline and the 40% removal setting is only 11.3%. This translates to a potential 40% reduction in retraining time while sacrificing a manageable 10% in accuracy. This makes our method a practical choice for scenarios where high accuracy is less critical.\u003c/p\u003e \u003cp\u003eThe early saturation of performance degradation in our method suggests its suitability for large-scale, long-term incremental learning. Its performance remains stable even with a growing number of tasks. Our method, categorized as a pseudo-rehearsal method, achieves competitive results with rehearsal methods that have access to original task data. This is possible because the perceptual autoencoder generates high-fidelity images (resolution: 224 x 224 x 3) that effectively capture the necessary information for classification.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"5. Conclusion","content":"\u003cp\u003eThis work introduces two novel methods to address challenges in class-incremental learning: image generation and exemplar selection; 1) Perceptual Autoencoder for Efficient Data Representation: We propose a new perceptual autoencoder that efficiently reconstructs images from a significantly reduced (6x smaller) feature map. This compressed representation minimizes distortion and avoids the need for hyperparameter tuning across different datasets. The small size of the latent feature map also helps alleviate storage concerns in incremental learning scenarios. 2) Generalizable Image Generation: Our approach enables the generation model to work effectively on various datasets and image resolutions using the same discriminator network. This demonstrates the generalizability of our strategy across different training data distributions. Furthermore, the method requires minimal hyperparameter tuning and avoids retraining, making it suitable for non-stationary training data encountered in incremental learning.\u003c/p\u003e \u003cp\u003eWe introduce an exemplar selection method that builds upon and improves existing work on mitigating example forgetting. This method is not only applicable to incremental learning but also offers an improvement over the original approach through a balanced removal strategy. Our proposed methods achieve state-of-the-art performance on both the CIFAR-100 and ImageNet-100 datasets for image classification tasks with incremental learning. By comparing our results with recent advancements in the field, we ensure the competitiveness of our strategy against existing methods. To the best of our knowledge, our work presents the first pseudo-rehearsal method that reports incremental learning performance using the ImageNet-100 dataset. This benchmark result can serve as a valuable baseline for future research on pseudo-rehearsal approaches in incremental learning. Our methods focus on improving the training data itself, rather than introducing entirely new CNN models specifically designed for incremental learning. This approach has the advantage of being applicable to any CNN architecture used for image classification, as it does not require modifications to the classification model itself. We propose that this data-oriented approach tackles a more general problem compared to methods that alter CNN architectures. Additionally, future advancements in image classification architectures would not render our approach obsolete.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthical Approval\u003c/strong\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eNot Applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe authors declare that the paper is free from known competing financial interests or personal relationships that could affect the works reported.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026apos; contributions\u003c/strong\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eTay Gee Yang conducted the experiments. \u0026nbsp;Swaraj Dube wrote the draft. \u0026nbsp;Hermawan Nugroho supervised and reviewed the article. \u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe project is supported by Department of Electrical and Electronic Engineering, University of Nottingham Malaysia.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe datasets used in in the study are available to readers. The datasets are deposited in publicly available repositories.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eFei-Fei L, Deng J, Li K (2010) ImageNet: Constructing a large-scale image database. J Vis 9:1037\u0026ndash;1037. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1167/9.8.1037\u003c/span\u003e\u003cspan address=\"10.1167/9.8.1037\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFolly KA (2017) Diversity increasing methods in PBIL-application to power system controller design: a comparison. Nat Comput 16. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s11047-016-9544-7\u003c/span\u003e\u003cspan address=\"10.1007/s11047-016-9544-7\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGondara L (2016) Medical Image Denoising Using Convolutional Denoising Autoencoders. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). IEEE, pp 241\u0026ndash;246\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGoodfellow I, Pouget-Abadie J, Mirza M et al (2014) Generative Adversarial Nets. Neural Information Processing Systems, NIPS 2014. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/ICCVW.2019.00369\u003c/span\u003e\u003cspan address=\"10.1109/ICCVW.2019.00369\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHan J, Liu Z, Li Y, Zhang T (2023) SCMP-IL: an incremental learning method with super constraints on model parameters. Int J Mach Learn Cybernet 14. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s13042-022-01725-1\u003c/span\u003e\u003cspan address=\"10.1007/s13042-022-01725-1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHe K, Zhang X, Ren S, Sun J Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and, Recognition P (2016) (CVPR). IEEE, pp 770\u0026ndash;778\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHinton G, Vinyals O, Dean J (2015) Distilling the Knowledge in a Neural Network. 1\u0026ndash;9\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJaved K, Shafait F (2019) Revisiting Distillation and Incremental Classifier Learning. pp 3\u0026ndash;17\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKemker R, McClure M, Abitino A et al (2018) Measuring catastrophic forgetting in neural networks. 32nd AAAI Conference on Artificial Intelligence, AAAI 2018 3390\u0026ndash;3398\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKirkpatrick J, Pascanu R, Rabinowitz N et al (2017) Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci U S A 114:3521\u0026ndash;3526. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1073/pnas.1611835114\u003c/span\u003e\u003cspan address=\"10.1073/pnas.1611835114\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKrizhevsky A (2009) Learning multiple layers of features from tiny images\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKrizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Neural Inform Process Syst 1106\u0026ndash;1114\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi Z, Hoiem D (2018) Learning without Forgetting. IEEE Trans Pattern Anal Mach Intell 40:2935\u0026ndash;2947. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/TPAMI.2017.2773081\u003c/span\u003e\u003cspan address=\"10.1109/TPAMI.2017.2773081\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMa R, Wu Q, Ngan KN et al (2023) Forgetting to Remember: A Scalable Incremental Learning Framework for Cross-Task Blind Image Quality Assessment. IEEE Trans Multimedia 25. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/TMM.2023.3242143\u003c/span\u003e\u003cspan address=\"10.1109/TMM.2023.3242143\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMallya A, Lazebnik S (2018) PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 7765\u0026ndash;7773. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/CVPR.2018.00810\u003c/span\u003e\u003cspan address=\"10.1109/CVPR.2018.00810\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMcCloskey M, Cohen NJ (1989) Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychology of Learning and Motivation. - Adv Res Theory 24:109\u0026ndash;165. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/S0079-7421(08)60536-8\u003c/span\u003e\u003cspan address=\"10.1016/S0079-7421(08)60536-8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOnchis DM, Samuila IV (2021) Double distillation for class incremental learning. In: Proceedings \u0026ndash;\u0026thinsp;2021 23rd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2021\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eParisi GI, Kemker R, Part JL et al (2019) Continual lifelong learning with neural networks: A review. Neural Netw 113:54\u0026ndash;71. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.neunet.2019.01.012\u003c/span\u003e\u003cspan address=\"10.1016/j.neunet.2019.01.012\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePaszke A, Gross S, Massa F et al (2019) PyTorch: An Imperative Style. High-Performance Deep Learning Library\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRebuffi SA, Kolesnikov A, Sperl G, Lampert CH (2017) iCaRL: Incremental classifier and representation learning. Proceedings \u0026ndash;\u0026thinsp;30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 2017-Janua:5533\u0026ndash;5542. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/CVPR.2017.587\u003c/span\u003e\u003cspan address=\"10.1109/CVPR.2017.587\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShin H, Lee JK, Kim J, Kim J (2017) Continual learning with deep generative replay. Adv Neural Inf Process Syst 2017\u0026ndash;Decem:2991\u0026ndash;3000\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSimonyan K, Zisserman A (2015) Very Deep Convolutional Networks For Large-Scale Image Recognition. International Conference on Learning Representations\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSolinas M, Reyboz M, Rousset S et al (2023) On the Beneficial Effects of Reinjections for Continual Learning. SN Comput Sci 4. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s42979-022-01392-7\u003c/span\u003e\u003cspan address=\"10.1007/s42979-022-01392-7\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSrivastava M, Grill-Spector K (2018) The Effect of Learning Strategy versus Inherent Architecture Properties on the Ability of Convolutional Neural Networks to Develop Transformation Invariance. ArXiv\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWu Y, Chen Y, Wang L et al (2019) Large Scale Incremental Learning. ArXiv 374\u0026ndash;382\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXiang Y, Fu Y, Ji P, Huang H (2019) Incremental learning using conditional adversarial networks. Proceedings of the IEEE International Conference on Computer Vision 2019-Octob:6618\u0026ndash;6627. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/ICCV.2019.00672\u003c/span\u003e\u003cspan address=\"10.1109/ICCV.2019.00672\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang Y, Wu QMJ, Wang Y (2018) Autoencoder With Invertible Functions for Dimension Reduction and Image Reconstruction. IEEE Trans Syst Man Cybern Syst 48:1065\u0026ndash;1079. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/TSMC.2016.2637279\u003c/span\u003e\u003cspan address=\"10.1109/TSMC.2016.2637279\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"incremental learning, Convolutional Neural Network, perceptual autoencoder, exemplar selection","lastPublishedDoi":"10.21203/rs.3.rs-4146505/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4146505/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eLifelong learning or incremental learning in convolutional neural networks (CNNs) has encountered a challenge known as catastrophic forgetting, which impairs model performance when tasks are presented sequentially. While a simple approach of retraining the model with all previously seen training data can alleviate this issue to some extent, it is not scalable due to the rapid accumulation of storage requirements and retraining time. To address this challenge, we propose a novel incremental learning strategy involving image data generation and exemplar selection. Specifically, we introduce a new type of autoencoder called the Perceptual Autoencoder, which reconstructs previously seen data while significantly compressing it, requiring no retraining when new classes are introduced. The latent feature map from the undercomplete Perceptual Autoencoder is stored and utilized to reconstruct training data for replay alongside new class data when necessary. Additionally, we employ example forgetting as an exemplar detection metric for exemplar selection, aiming to minimize the number of old task training data while preserving model performance. Our proposed strategy achieves state-of-the-art performance on both CIFAR-100 and ImageNet-100 datasets.\u003c/p\u003e","manuscriptTitle":"Perceptual Autoencoder and Exemplar Selection for Lifelong Learning in Convolutional Neural Networks (CNNs)","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-03-28 17:46:16","doi":"10.21203/rs.3.rs-4146505/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"61fc249d-9973-4097-94b3-04c12c44159c","owner":[],"postedDate":"March 28th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-04-01T10:12:06+00:00","versionOfRecord":[],"versionCreatedAt":"2024-03-28 17:46:16","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4146505","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4146505","identity":"rs-4146505","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00