• Aishwarya Pothula

Generative Temporal Model with Memory to Learning Physics Concepts

The blog is part of a series of five posts in which I summarize academic papers and answer a few questions pertinent to them. These papers and related questions were given to my by my PhD committee as part of my comprehensive exam.

The papers I will be writing about in this post can be access using the following link.

Probing Physics Knowledge Using Tools from Developmental Psychology

Generative Temporal Models with Memory


Synopsis - Generative Temporal Models with Memory

The paper proposes a Generative Temporal Model with Memory[13] to solve tasks involving complex, long-term temporal dependencies. It highlights three memory architectures that have different addressing schemes. The authors believe that memory addressing techniques play a critical role in determining data efficiency.

Synopsis - Probing Physics Knowledge using Tools from Developmental Psychology

The paper[14] introduces probe datasets for physics concepts such as object permanence, continuity, solidity etc. A baseline model is applied to these datasets and the VOE techniques, from developmental psychology, is applied to this artificial learning system to assess the acquirement of physics knowledge. The baseline model used is the model proposed in the paper "Generative Temporal Models with Memory"(GTMM).


How you can program a model to pass the developmental psychology? From the references, explain their method to train agent without reward.

Model presented in "Probing Physics Knowledge using Tools from Developmental Psychology"

The paper applies the Violation of Expectations(VOE) method, a developmental psychology metric that estimates surprise by observing looking time to infer understanding of a concept, to artificial learning systems. Application of VOE to artificial learning systems is made possible by Itti and Baldi [12] who proposed a method to model surprise mathematically.


Intuition about mathematically formulating surprise

Consider the Bayes thorem,

where P(M) is the prior distribution over µ possible models and P(M|D) is the posterior. Here, the role of the instance D can be viewed as the re-evaluation of prior beliefs (prior distribution) and transforming the prior probabilities to posterior distribution according to Bayes theorem.

Using this intuition, Itti and Baldi propose the formulation of surprise the calculation of the distance between prior and posterior distributions. They use KL divergence to calculate the distance.

The authors, in order to apply VOE, choose the model proposed in the paperGTMM, as it readily allows for the calculation of VOE. GTMM paper proposes three memory addressing schemes viz; Neural Turing Machine, combining content based and positional addressing, Least-Recently Used (LRU) access, using content-based addressing exclusively, and Differentiable Neural Computer, combining content-based and positional addressing. However, the baseline model in the paper uses only the LRU approach.

The baseline model is a VRNN with LRU mechanism for memory. It can be understood as an RNN with a VAE as the core computational unit. The hidden state of the VAE is determined from external memory.

The generic GTM model is specified below with it operating on a set of observations X{<=T} = {x1,x2,....xT} to approximately infer a set of latent variables Z{<=T} = {z1,z2,....zT}

The first term of the equation is the likelihood function and the second term is the prior.

For GTM with memory the prior and posterior change as follows

To specify VRNN, four maps need to be specified

Posterior Map: It defines the approximate posterior distribution over 256 gaussian latents which is calculated by

prior + output from MLP which takes an image passed through a convolutional neural network + memory output from previous step + another copy of the prior

Prior Map: The prior map specifies the prior distribution over latents. It depends on

history of latent variables + memory output at previous step

Note: memory output for both Posterior and Prior are the same. It is the weighted sum over three memory slots.

Observation Map: It specifies the parameters of the likelihood function as a

f(sample from posterior)

Transition Map: It is LRU mechanism with LSTM. It specifies hoe memory and hidden state are updated at each time step

The model is trained with stochastic gradient descent on the variation lower bound given by

This per-step computation of KL divergence makes this model viable for a smooth application of VOE.

"Per time-step KL-divergences measure the number of bits of additional information needed to represent the posterior distribution relative to the prior distribution over the latent variable being used to explain the current observation." It means that if the KL divergence is zero, then the observation is fully predictable from the previous observations.

Training and Results

The model is trained over consistent examples and controls (link to my presentation on the physics concepts and datasets in the paper). It is found that the KL divergence/surprise is more for inconsistent probes (videos displaying consistent and inconsistent probes) than for consistent probes, indicating that the network has the ability to recognize violations of these physics concepts.


  1. Legg, Shane and Marcus Hutter. “Universal Intelligence: A Definition of Machine Intelligence.” Minds and Machines 17 (2007): 391-444.

  2. Law, James et al. “A psychology based approach for longitudinal development in cognitive robotics.” Frontiers in Neurorobotics 8 (2014): n. pag.

  3. Y. Wu, W. Ji, X. Li, G. Wang, J. Yin and F. Wu, "Context-Aware Deep Spatiotemporal Network for Hand Pose Estimation From Depth Images," in IEEE Transactions on Cybernetics, vol. 50, no. 2, pp. 787-797, Feb. 2020, doi: 10.1109/TCYB.2018.2873733.

  4. C. Doersch, A. Gupta, and A. A. Efros. Unsupervised vi- sual representation learning by context prediction. In ICCV,2015.

  5. D. Pathak, R. B. Girshick, P. Dolla ́r, T. Darrell, and B. Har- iharan. Learning features by watching objects move. In CVPR, 2017.

  6. X. Dong, S.-I. Yu, X. Weng, S.-E. Wei, Y. Yang, and Y. Sheikh. Supervision-by-registration: An unsupervised approach to improve the precision of facial landmark detectors.In CVPR, 2018.

  7. S. Tulsiani, A. A. Efros, and J. Malik. Multi-view consis- tency as supervisory signal for learning shape and pose pre- diction. CVPR, 2018

  8. A.J.Joshi, F.Porikli, and N.Papanikolopoulos,“Multi-classactive learning for image classification,” Computer Vision and Pattern Recognition (CVPR), 2009.

  9. Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” International Conference on Machine Learning (ICML), 2009.

  10. G. Hacohen, L. Choshen, and D. Weinshall, “Let’s Agree to Agree: Neural Networks Share Classification Order on Real Datasets,” International Conference on Learning Representations (ICLR), 2020.

  11. Lee, M. H. 2020. How to Grow a Robot: Developing Human-Friendly, Social AI, 1–10

  12. Baldi, Pierre and Laurent Itti. “Of bits and wows: A Bayesian theory of surprise with applications to attention.” Neural networks : the official journal of the International Neural Network Society 23 5 (2010): 649-66

  13. Gemici, Mevlana, et al. "Generative temporal models with memory." arXiv preprint arXiv:1702.04649 (2017)

  14. Piloto, Luis, et al. "Probing physics knowledge using tools from developmental psychology." arXiv preprint arXiv:1804.01128 (2018)

0 views0 comments

Recent Posts

See All

The blog is part of a series of five posts in which I summarize academic papers and answer a few questions pertinent to them. These papers and related questions were given to my by my PhD committee as

The blog is part of a series of five posts in which I summarize academic papers and answer a few questions pertinent to them. These papers and related questions were given to my by my PhD committee as