A Wholistic View of Continual Learning
Updated: Jan 27
The blog is part of a series of five posts in which I summarize academic papers and answer a few questions pertinent to them. These papers and related questions were given to my by my PhD committee as part of my comprehensive exam.
The paper I will be writing about in this post can be access using the following link.
The paper suggests a framework that bridges continuous and active learning through open world awareness for developing more robust learning systems that are able to deal with unknown data.
The authors observe that the main challenge in continual learning is framed as protecting previously acquired representations from being catastrophically forgotten due to iterative parameter updates. However, there is a bigger challenge, the authors state.
Evaluation of continual learning is plagued by the close-world assumption, that models will encounter data from the same distribution that they have been trained on, during deployment. This assumption proves to be a challenge as networks are known to predict over-confident false predictions on unknown instances and break in when faced with corrupted data.
In this paper, the authors argue that forgotten lessons from open set recognition and active learning can be helpful in achieving a consolidated view bridging continual learning, active learning and open set recognition.
Before summarizing the approach presented in the paper to for a consolidated view, let's review the key terms continual learning, open-set recognition and active learning mean.
Continual Learning/ Life long ML:
These desiderata are given by Chen and Liu. Continuous learning is a process that entails leveraging data as it arrive over time. Knowledge accumulation and maintenance refers retaining knowledge obtained from encountered tasks. The ability to use past knowledge refers to using knowledge acquired from any tasks learned previously irrespective of the original order for aiding future learning. In a recent iteration of the definition, Chen and Liu added two more desiderata: the ability to discover new tasks and to learn on the job.
In active learning, the objective of the learning system is to query the most appropriate data to include next in order to incrementally find the best approximation to a task's solution. In continual learning, a system tries to retain information acquired at each step without endlessly accumulating data. An active learning system, complementarily, tries to find the best data for inclusion into an incrementally training system.
Open set recognition refers to identification of statistically deviating data outside of the observed dataset. It means to separate known data from unknown unknowns.
Forgotten Lessons from these areas
Forgotton Lesson 1:
The authors note that most machine learning models are trained in a closed world setting and when these models are deployed in the real-world, they yield overconfident predictions. The authors suggest that these models should at least be equipped with a mechanism to recognize unencountered data/scenarios and warn the practitioner.
Forgotten Lesson 2:
Here, the authors argue that uncertainty sampling might not be the best method to choose the next query in active learning. While prediction uncertainty is associated with unknown unknowns, it is not exclusive to them. Uncertainty sampling might lead to a large chance that meaningless outliers being included in the system.
Forgotten Lesson 3:
Some methods, as a solution to identifying unknown unknowns, resort to explicit optimization of negative examples. This might not be the best approach. Firstly, because the approach assumes that a huge dataset of unlabelled unknowns is available readily to include. In real world scenarios, instances for continuous training become available at different times. It also sidesteps the problem of identifying unknown unknowns to trying to find a boundary between known and existing set of unseen data, which by definition then does not consist of unknown unknowns. Secondly, it is an obvious argument that it is impossible to include all forms of variations and exception to data upfront.
Forgotten Lesson 4:
Though data and task ordering are essential component of active learning, many modern day deep continual learning problems pay little attention to them, not the authors.
Researchers Joshi et al. found that certain strategies such as creating a class imbalance such that certain complex classes require denser sampling to benefit active learning. Similarly, Bengio et al. have found that categorizing classes of data into a curriculum which introduces classes into the training according to the difficulty to benefit active learning. Hacohen et al  have found that deep neural networks inherently build such a curriculum; they choose, across architectures, to learn the same examples first when given access to the whole dataset.
Forgotten Lesson 5:
The authors point out that parameter and architecture growth are not to be seen as separate solutions to address challenges such as a catastrophic forgetting. Instead, the authors argue, that parameter and architecture growth are integral part of the learning process. A highly parametrized network might benefit with expansive representational capacity when encountering new data. It might be useful for some algorithms to be affixed with representational expansion. On the other hand, it has been shown that in active learning, it is more effective, in terms of computation and accuracy, to train in small sample scenarios. This observation also brings with it a challenge. Due to this quality of active learning, it is difficult to ascertain gains in active/continual learning to any particular technique in contrast to the innate advantages of the used architecture.
Natural Interface between continual and active learning - open set recognition
The authors state that for continual and active learning, an awareness of open world not only helps in developing robust systems but also in providing the means to merge techniques into a common perspective.
Boundary between known and unknowns
The first step towards open world aware active and continual learning is training the classifying VAE to identifying the boundary between open and closed spaces for observed distribution using EVT. The observed aggregate posterior distribution is
An EVT based fit can be obtained by empirically accumulating the mean latent variable for each class c for all correctly predicted known data points m = 1,...,M
and defining a respective set of latent distances as:
Approximate Posterior Open set Recognition
Statistic outlier probability of every known class is calculated as
The minimum of this value is taken across all classes c and the respective mode's parameters $p_c$. A data point is then considered an outlier if its outlier probability is large for each known class. The greater the dissimilarity found to the observed distribution, using the aggregate posterior, the more the outlier probability with reach 1.
Outlier and redundancy aware active queries
The next query for active learning can be based on the statistic outlier probability. However, sampling instances with very outlier probability would not protect active learning models from noisy and uninformative data.A solution proposed by the authors is to sample a variety of data across the middle of the CDF of outlier probability between 0.5 to 0.95.
Core set selection for continual learning rehearsal
Whereas in active learning the goal is to query next suitable instance to learn, in continual learning, the goal is to protect previously acquired knowledge while learning a new predetermined task.
The core set can be obtained by picking data points that are closest to the obtained cosine distance values (if scalar) or to the latent vector (if dimensionality is preserved). We can either inverse sample from the outlier probability CDF or directly from the aggregate posterior. However, inverse sampling from outlier probability CDF provides more robustness especially when the system has finished learning and is deployed as the outlier probabilities can be limited, as the authors have suggested, to p<0.95
Class Incremental Curricula and Task Order
The authors make a suggestion on how to select task order instead of executing completely arbitrary class incremental evaluation.
The suggestion is to choose any task t and with the help of outlier probability determine its similarity to the observed distribution. The select tasks with the least overlap (or most overlap) to the previous tasks.The equation to select the next task t is
The last section of the paper deals with empirically testing the framework the authors have presented to combine the benefits of active, continual and open world recognition.
The paper presents extensive explanations of the concepts presented in the paper. The framework presented in the paper bridging continuous and active learning through open world awareness will help in building more robust learning models that don't breakdown as easily when encountered with unseen data or when deployed in real-world settings.
The paper does not go into detail about how continuous learning systems can deal with knowledge accumulation and maintenance of knowledge base to help with future learning.
Are there any topics in this paper relevant to your research? How?
The paper covers topics such as continual learning, active learning, open set recognition.
In my research these concepts are helpful in developing the baby agent learning model. Since our research philosophy has been the let the baby agent 'grow'  in the environment -SEDRO, it is essential that data is presented to it over time with increasing complexity, in accordance with the developing capabilities of the agent.
Consequently, the baby agent model will need to leverage tasks that present themselves over time while not forgetting knowledge acquired from previous tasks. Additionally, there needs to be a mechanism in place to maintain this learning database that the knowledge can be used to aid future learning. The baby agent needs to be able to identify new tasks to learn as it explores the open-ended environment of SEDRo. It follows that there is an obvious need for curriculum learning with emphasis on curriculum learning/task ordering. The research plan is to design SEDRo to automatically unlock task categories according to a curriculum developed based on cognitive development theories and milestones from developmental psychology. In this scenario, the suggestion made by the authors to sample the next task in active learning based on CDF of outlier probabilities of tasks, from the category of unlocked tasks, maybe useful.
In our research, the closed world assumption currently holds as we plan to develop a baby agent model that will grow in the SEDRo environment. However, in the future, if the plan is to enhance the model to be robust for an open world deployment, focussing in open world awareness, as suggested in the paper, will make the model robust by ensuring it knows the boundary between known and unknown tasks.
Legg, Shane and Marcus Hutter. “Universal Intelligence: A Definition of Machine Intelligence.” Minds and Machines 17 (2007): 391-444.
Law, James et al. “A psychology based approach for longitudinal development in cognitive robotics.” Frontiers in Neurorobotics 8 (2014): n. pag.
Y. Wu, W. Ji, X. Li, G. Wang, J. Yin and F. Wu, "Context-Aware Deep Spatiotemporal Network for Hand Pose Estimation From Depth Images," in IEEE Transactions on Cybernetics, vol. 50, no. 2, pp. 787-797, Feb. 2020, doi: 10.1109/TCYB.2018.2873733.
C. Doersch, A. Gupta, and A. A. Efros. Unsupervised vi- sual representation learning by context prediction. In ICCV,2015.
D. Pathak, R. B. Girshick, P. Dolla ́r, T. Darrell, and B. Har- iharan. Learning features by watching objects move. In CVPR, 2017.
X. Dong, S.-I. Yu, X. Weng, S.-E. Wei, Y. Yang, and Y. Sheikh. Supervision-by-registration: An unsupervised approach to improve the precision of facial landmark detectors.In CVPR, 2018.
S. Tulsiani, A. A. Efros, and J. Malik. Multi-view consis- tency as supervisory signal for learning shape and pose pre- diction. CVPR, 2018
A.J.Joshi, F.Porikli, and N.Papanikolopoulos,“Multi-classactive learning for image classification,” Computer Vision and Pattern Recognition (CVPR), 2009.
Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” International Conference on Machine Learning (ICML), 2009.
G. Hacohen, L. Choshen, and D. Weinshall, “Let’s Agree to Agree: Neural Networks Share Classification Order on Real Datasets,” International Conference on Learning Representations (ICLR), 2020.
Lee, M. H. 2020. How to Grow a Robot: Developing Human-Friendly, Social AI, 1–10
Baldi, Pierre and Laurent Itti. “Of bits and wows: A Bayesian theory of surprise with applications to attention.” Neural networks : the official journal of the International Neural Network Society 23 5 (2010): 649-66
Gemici, Mevlana, et al. "Generative temporal models with memory." arXiv preprint arXiv:1702.04649 (2017)
Piloto, Luis, et al. "Probing physics knowledge using tools from developmental psychology." arXiv preprint arXiv:1804.01128 (2018)