Objective: Develop a computational theory that advances our understanding of the relationship between visual perception and spatio-temporal language.
Our Approach: We humans have the ability to understand and describe a visual scene in natural language. For example, the “cat is sitting on the couch”, and conversely understand such descriptions so that we can visualize the meaning conveyed by them. What mental representations and processes do we go through to transform what we see (i.e., perception) into words and vice versa? We took a baby step in answering this broad and important AI and cognition question by focusing on spatio-temporal descriptions such as “cat *on* couch”. We developed integrated visual spatial representations and linked them to computational meaning representations. This allowed us to algorithmically go back and forth between scenes and language. We evaluated our theory with implementation by testing them on real world tasks such as placing an object given a natural language directive “put the printer on the table” and answering where questions such as “where are my glasses”.
With 100s of test cases gathered over the internet we demonstrated our algorithms come as close as 60% of human performance. In our basic approach, we manually developed the visual and language representations; later we utilized a machine learning approach to learn such representation and use to perform tasks the system had never seen before. Yes, although our agents and algorithms are not as good as humans on these tasks yet, this was a very encouraging first step.
Solution We Delivered: We advanced the representations and mechanisms that underlie at the intersection of agent perception and language. Our theories and proof-of-concept implementations substantially reduced the R&D risks for ONR and the US Navy toward the development of next generation unmanned and command and control systems. We authored following patents and scientific publications, three algorithms implementations.