End-to-End Navigation with VLMs: Transforming Spatial Reasoning into Question-Answering

1University of California Berkeley, 2University of Pennsylvania
teaser

The full prompt to transform a VLM into an embodied agent consists of three parts: A system prompt to describe the embodiment, an action prompt to describe the task, potential actions, and output instruction, and an image prompt showing the current observation along with the annotated actions

Abstract

We present VLMnav, an embodied framework to transform a Vision and Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task.

We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions.

Video

Example Trajectories

Failure Cases

Overview

We present VLMnav, designed as a navigation system that takes as input goal G, which can be specified in language or an image, RGB-D image, pose and subsequently outputs action a. The action space consists of rotation about the yaw axis and displacement along the frontal axis in the robot frame, which allows all actions to be expressed in polar coordinates. As it is known that VLMs struggle to reason about continuous coordinates, we instead transform the navigation problem into the selection of an action from a discrete set of options. Our core idea is to choose these action options in a way that avoids obstacle collisions and promotes exploration.

Overview

Our method is made up of four key components: (i) Navigability, (ii) Action Proposer, (iii) Projection, and (iv) Prompting. An example update step to the map shows the marking of new area as explored (gray) or unexplored (green)

We start by determining the navigability of the local region by estimating the distance to obstacles using a depth image. Similar to other works, we use the depth image and pose information to maintain a top-down voxel map of the scene, and notably mark voxels as explored or unexplored. Such a map is used by the action proposer module to determine a set of actions that avoid obstacles and promote exploration. We then project this set of possible actions to the first-person-view RGB image with the projection component. Finally, the VLM takes as input this image and a carefully crafted prompt to select an action, which the agent executes. To succesfuly complete a navigation goal, the agent must call a special STOP action within D meters of the goal object. We use a seperate VLM call and prompt to determine when to call this termination action. In all our experiments, we use Gemini Flash as our VLM.

What is the model thinking?

Baseline: PIVOT

We evaluate PIVOT (Google 2024), a similar prompting framework for robotic tasks. At each step, an isotropic Gaussian action distribution is iteratively fit to a subset of actions chosen by the VLM. The VLM chooses an action from this final distribution to execute.

teaser

Example 1: Bed. The agent completes the goal successfully, but spends a lot of time turning around

teaser

Example 2: Toilet. The agent fails and gets stuck in the corner. The action distribution struggles to represent the multi-modality of the action space

Baseline w/o nav

To evaluate the direct impact of our prompting method, we run a baseline without the navigability and action proposer modules. This agent is presented with a static set of evenly spaced actions, which do not take into account navigability or exploration.

teaser

Example 1: Bed. The agent fails to complete the goal, as it gets stuck behind a chair and keeps trying to move through it

teaser

Example 2: Toilet. The agent completes the goal successfully but spends several steps stuck in the corner of the bedroom

Results

We evaluate our approach on two common navigation benchmarks, ObjectNav and GOATBench. In addition to PIVOT and w/o nav we run a baseline prompt only, which sees a textual description of the actions, but no visual annotations. We measure Success Rate (SR), which measures accuracy in completing goals, and SPL, which measures path efficiency. We see below that our method outperforms all baseline prompting methods on both benchmarks. We note that when comparing to other state-of-the-art works on even ground, we find our performance to be inferrior.

ObjectNav
Method SR SPL
Ours 50.4% 0.210
w/o nav 33.2% 0.136
prompt only 29.8% 0.107
PIVOT 24.6% 0.106
GOAT
Method SR SPL
Ours 16.3% 0.066
w/o nav 11.8% 0.054
prompt only 11.3% 0.037
PIVOT 8.3% 0.038

BibTeX

@inproceedings{
      goetting2024endtoend,
      title={End-to-End Navigation with VLMs: Transforming Spatial Reasoning into Question-Answering},
      author={Dylan Goetting and Himanshu Gaurav Singh and Antonio Loquercio},
      booktitle={Workshop on Language and Robot Learning: Language as an Interface},
      year={2024},
    }