Following a navigation instruction such as ‘Walk down the stairs and stop at the brown sofa’ requires embodied AI agents to ground referenced scene elements referenced (e.g. ‘stairs’) to visual content in the environment (pixels corresponding to ‘stairs’). We ask the following question – can we leverage abundant ‘disembodied’ web-scraped vision-and-language corpora (e.g. Conceptual Captions) to learn the visual groundings that improve performance on a relatively data-starved embodied perception task (Vision-and-Language Navigation)? (Read more)