[Pycon] [new paper] "Vedika Agarwal" - Towards Interpretable Visual Question Answering via Compositional Reasoning

Lun 7 Gen 2019 17:03:33 CET

Title: Towards Interpretable Visual Question Answering via Compositional Reasoning
Duration: 60 (includes Q&A)
Q&A Session: 15
Language: en
Type: Talk

Abstract: Humans have amazing ability to perform compositional visual reasoning. What I mean by that is we possess basic visual reasoning skills like identifying objects, identifying colors and spatial relationships. And we can easily compose these basic skills to solve novel tasks. For example, in order to answer the question “What color is the cube to the right of the large metal sphere?”, a model must identify which sphere is the large metal one, understand what it means for an object to be to the right of another, and apply this concept spatially to the attended sphere. Within this new region of interest, the model must find the cube and determine its color. A visual question answering (VQA) model must be capable of complex spatial reasoning over an image.

This idea of compositional visual reasoning is something very powerful that humans can do and we hope to build computer vision systems one day that can do this as well. Addressing these complex questions require compositional models for visual reasoning which bakes the idea of compositionality and basic skills directly into the heart of the model. This talk will cover a high-level overview of some the compositional models Neural Module Networks by Hu et al., PG+EE model  by Johnson et al., Stack NMN by Hu et al.  Towards the end, we will visualize the attention maps that will help us understand the compositional learning nature of the models making the models reasoning process more interpretable at the same time. 

Tags: [u'ComputerVision', u'Deep-Learning', u'natural-language-processing']