• Aishwarya Pothula

Need for A Physically Embodied Turing Test

Updated: Jan 27

The blog is part of a series of five posts in which I summarize academic papers and answer a few questions pertinent to them. These papers and related questions were given to my by my PhD committee as part of my comprehensive exam.

The paper I will be writing about in this post can be access using the following link.

Why We Need a Physically Embodied Turing Test and What It Might Look Like



In this article, the author argues that the Turing Test, originally designed to be a sufficient test for intelligence, proves to be a weak test with its main focus on only one aspect of intelligence : language and its use. The author argues action, perception and commonsense reasoning along with language to be integral components of intelligence and that a test for intelligence must be sufficient to test for this comprehensive view of intelligence. Initial suggestions for a such a test are proposed.


The original Turing Test focussed mainly on language and its use as a measure of intelligence. The author points out that the setup of the test itself, where the human tester can communicate with the human participant and the computer system without seeing them , indicates no assumption of the role of physical embodiment in intelligence.

Intelligence as a paradigm is widely accepted to constitute of the use of language, perception, action and commonsense reasoning. The author argues that the Turing Test is not comprehensive enough to test for this view of intelligence for the following reasons.

  1. " Disembodied thought alone", which the Turing Test is built to evaluate, "cannot get one very far in the world", states the author. It is generally accepted that an agent needs to be able to perceive and act in its environment, for which embodiment is a necessity, to be considered intelligence.

  2. "From the perspective of action, the Turing test can only be used to judge descriptions of actions that one could argue were sufficiently detailed to be, in principle, executable". For example, a sentence like " Little Johnny tied his shoelace", without perception, is extremely complex to represent as knowledge to an artificial agent.

  3. "The passing of the test by a machine would certainly justify one in announcing the arrival of human-level AI, but along the way, it can only provide a rather crude measure." The author means to say that the Turing Test has no means to evaluate the developmental progress of intelligence in an agent. Though tests such as the Winograd Challenge attempt to solve this issue, they still do not address the embodiment aspect of intelligence.

  4. Functional Individuation of Objects: "our faculty of visual perception by itself, without the benefit of being able to interact with an object or reason about its behavior, runs up against its own difficulties when it attempts to recognize correctly many classes of objects". For example, recognizing a simple object such as a hinge involves perceiving a) its similarity to other seen hinges and categorizing it as one, b) as an interactable object with a certain physical behavior, c)its functionality in the context it is present in. This kind of perception necessitates the integration of perception, action and commonsensical reasoning.

Hence, to address the concerns raised so far, the author states that a test should satisfy the following criteria: " physical embodiment coupled with reasoning and communication, support for incremental development, and the existence of clear quantitative measures of progress."

The author goes on to propose initial suggestions for such a test. The challenge/test proposed by the author consists of two tracks: construction track and exploration track.

The construction challenge requires an agent to build pre-defined structures (ex: ready-to-assemble furniture, building blocks) with the aid of a mix of verbal and pictorial instructions. The challenge can also be extended to require collaboration with other agents (human or robotic) to complete the tasks. The tasks are incrementally made more complex to evaluate developmental progress.

Suggestion for the stages of progression by the author is given below

The exploration track focusses on an agent's ability of interact and experiment with complex structures to make improvisations to them. The track also involves building dynamic structures such as contraptions and describing in natural language their working. The exploration track too has incremental levels of complexity. Suggestion by the author for these level is below

The author believes that the proposed challenge will satisfy identified criteria by requiring the agent to communicate in natural language, act in instructed and explorative ways through embodied perception, make improvisations using commonsensical reasoning. Moreover, levels of complexity in the tasks allows for evaluation of developmental progress.


The article puts forward convincing arguments that a sufficient test for intelligence, apart from language, should also pay heed to an agent's ability to perceive and act in an environment. In our research, we strongly believe in this philosophy of evaluation of intelligence.


We humans are able to apply and transfer learnings across multiple domains such as sports, cooking, reading etc. I find that the article does not explicitly mention this aspect of human intelligence as part of its view of intelligence to be tested for.


Discuss how the arguments and proposals made in this paper relate to the evaluations and evaluation environments you have been creating

Before designing evaluations for intelligence, we found it pertinent to have a working definition of intelligence and human-level intelligence. We defined them as following


"intelligence is defined as an agent's ability to achieve goals in a wide range of environments"[1]

Human-Level intelligence

"An agent has human-level artificial intelligence if there exists a sequence of symbols (a symbolic description) for every feasible experience, such that the agent can update the behavior policy equally, whether it goes through the sequence of sensory inputs and actions or it receives only the corresponding symbolic description."

In essence, the definition states that a human's ability to learn from other's experiences through language is what separates human intelligence.

In the following section, I describe how my evaluation relates to argument and proposals made in the paper.

Testing intelligence as a combination of language, action, perception, commonsense reasoning.

The final test we propose to evaluate human-level intelligence in agents called LAT - Language Acquisition test where it is checked, according to our definition of human-level intelligence if an agent is able to change its behavior towards an object or an event based on receiving just a description of the object/event.

To paraphrase my professor, Dr.Park's example, imagine that an agent has never encountered cola but receives a positive description from the caretaker agent that it is a dark sparkling liquid and tastes good when you drink it. The LAT evaluates if this description of the liquid brings about the same change in the agent's behavior, towards cola as when the agent directly experiences it and likes it. By behavior, we mean is the agent inclined to drink cola.

Passing this kind of a test requires the agent to have a grounded understanding of language, ability to perceive cola as a drink, reason and take a certain action.

However, the tricky part about LAT is that it is the final test and if an agent fails the LAT, we are not sure if the failure is due to lack of learning experiences for the agent or because the does not have the capability, and if so what specific capability.

Evaluating progress

In an attempt to address this problem and ensure that we can measure progress in development and learning, we have designed an curriculum evaluation approach that tests for developmental milestones in multiple domains such as vision, motor and social as the agent grows in capability.

These tests in the evaluation curriculum are inspired by tests and metrics in developmental psychology. Developmental psychology being a field concerned with evaluating cognitive development , we believe, provides the right foundation for setting developmental milestones and related evaluation for them.

Tests in the curriculum evaluate progress and are administered based on the growing capabilities of the system and not on a specific time limit. Only when the agent achieves a milestone is the next evaluation administered. Below is an excerpt from a table in Law, James et al,2014 [2] that describes milestone development in infants.

Currently, the tests that I have implemented are in the domain of vision and test for the presence of physics concepts such as unity perception, object permanence, continuity, solidity, containment etc.


Our environment, SEDRo, being limited to providing experiences of a 1 year old infant, evaluation is highly dependent on observing the exhibited non-verbal behavior of the baby agent. Behavior such as eye gaze, reaching movements of the hands, for which embodiment is essential, play a vital role in evaluating learning in the baby agent. For example, we use the direction and duration of eye gaze to estimate surprise in determining if the agent possess Unity Perception.

Other examples of non-verbal behavior used for evaluation are presented in this excerpt from a table in Law, James et al,2014 [2].

I feel our research and evaluation is mostly in agreement with the arguments made in the article. The difference however is that the comprehensive view of intelligence presented in the article is being evaluated only to the level of a 1 year old infant. For example, the vocabulary and understanding of language being tested as well as physics concepts are at infant level.


  1. Legg, Shane and Marcus Hutter. “Universal Intelligence: A Definition of Machine Intelligence.” Minds and Machines 17 (2007): 391-444.

  2. Law, James et al. “A psychology based approach for longitudinal development in cognitive robotics.” Frontiers in Neurorobotics 8 (2014): n. pag.

  3. Y. Wu, W. Ji, X. Li, G. Wang, J. Yin and F. Wu, "Context-Aware Deep Spatiotemporal Network for Hand Pose Estimation From Depth Images," in IEEE Transactions on Cybernetics, vol. 50, no. 2, pp. 787-797, Feb. 2020, doi: 10.1109/TCYB.2018.2873733.

  4. C. Doersch, A. Gupta, and A. A. Efros. Unsupervised vi- sual representation learning by context prediction. In ICCV,2015.

  5. D. Pathak, R. B. Girshick, P. Dolla ́r, T. Darrell, and B. Har- iharan. Learning features by watching objects move. In CVPR, 2017.

  6. X. Dong, S.-I. Yu, X. Weng, S.-E. Wei, Y. Yang, and Y. Sheikh. Supervision-by-registration: An unsupervised approach to improve the precision of facial landmark detectors.In CVPR, 2018.

  7. S. Tulsiani, A. A. Efros, and J. Malik. Multi-view consis- tency as supervisory signal for learning shape and pose pre- diction. CVPR, 2018

  8. A.J.Joshi, F.Porikli, and N.Papanikolopoulos,“Multi-classactive learning for image classification,” Computer Vision and Pattern Recognition (CVPR), 2009.

  9. Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” International Conference on Machine Learning (ICML), 2009.

  10. G. Hacohen, L. Choshen, and D. Weinshall, “Let’s Agree to Agree: Neural Networks Share Classification Order on Real Datasets,” International Conference on Learning Representations (ICLR), 2020.

  11. Lee, M. H. 2020. How to Grow a Robot: Developing Human-Friendly, Social AI, 1–10

  12. Baldi, Pierre and Laurent Itti. “Of bits and wows: A Bayesian theory of surprise with applications to attention.” Neural networks : the official journal of the International Neural Network Society 23 5 (2010): 649-66

  13. Gemici, Mevlana, et al. "Generative temporal models with memory." arXiv preprint arXiv:1702.04649 (2017)

  14. Piloto, Luis, et al. "Probing physics knowledge using tools from developmental psychology." arXiv preprint arXiv:1804.01128 (2018)

0 views0 comments

Recent Posts

See All

The blog is part of a series of five posts in which I summarize academic papers and answer a few questions pertinent to them. These papers and related questions were given to my by my PhD committee as

The blog is part of a series of five posts in which I summarize academic papers and answer a few questions pertinent to them. These papers and related questions were given to my by my PhD committee as