Vision-Language-Action Model

Assembly Line

OpenVLA: An Open-Source Vision-Language-Action Model

📅 Date:

✍️ Authors: Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti

🔖 Topics: Vision-Language-Action Model, Large Language Model, Industrial Robot

🏢 Organizations: Stanford University


Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption. Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. As a product of the added data diversity and new model components, OpenVLA demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters. We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4%. We also explore compute efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets.

Read more at arXiv

LINGO-2: Driving with Natural Language

📅 Date:

🔖 Topics: Autonomous Vehicle, Vision-Language-Action Model

🏢 Organizations: Wayve


This blog introduces LINGO-2, a driving model that links vision, language, and action to explain and determine driving behavior, opening up a new dimension of control and customization for an autonomous driving experience. LINGO-2 is the first closed-loop vision-language-action driving model (VLAM) tested on public roads.

Our previous model, LINGO-1, was an open-loop driving commentator that leveraged vision-language inputs to perform visual question answering (VQA) and driving commentary on tasks such as describing scene understanding, reasoning, and attention—providing only language as an output. This research model was an important first step in using language to understand what the model comprehends about the driving scene. LINGO-2 takes that one step further, providing visibility into the decision-making process of a driving model. LINGO-2 combines vision and language as inputs and outputs, both driving action and language, to provide a continuous driving commentary of its motion planning decisions. LINGO-2 adapts its actions and explanations in accordance with various scene elements and is a strong first indication of the alignment between explanations and decision-making. By linking language and action directly, LINGO-2 sheds light on how AI systems make decisions and opens up a new level of control and customization for driving.

Read more at Wayve Blog

🧠🦾 Google’s Robotic Transformer 2: More Than Meets the Eye

📅 Date:

✍️ Author: Michael Levanduski

🔖 Topics: Transformer Net, Machine Vision, Vision-language-action Model

🏢 Organizations: Google


Google DeepMind’s Robotic Transformer 2 (RT2) is an evolution of vision language model (VLM) software. Trained on images from the web, RT2 software employs robotics datasets to manage low-level robotics control. Traditionally, VLMs have been used to combine inputs from both visual and natural language text datasets to accomplish more complex tasks. Of course, ChatGTP is at the front of this trend.

Google researchers identified a gap in how current VLMs were being applied in the robotic space. They note that current methods and approaches tend to focus on high-level robotic theory such as strategic state machine models. This leaves a void in the lower-level execution of robotic action, where the majority of control engineers execute work. Thus, Google is attempting to bring the power and benefits of VLMs down into the control engineers’ domain of programming robotics.

Read more at Control Automation

🧠🦾 RT-2: New model translates vision and language into action

📅 Date:

🔖 Topics: Robot Arm, Transformer Net, Machine Vision, Vision-language-action Model

🏢 Organizations: Google


Robotic Transformer 2 (RT-2) is a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control.

High-capacity vision-language models (VLMs) are trained on web-scale datasets, making these systems remarkably good at recognising visual or language patterns and operating across different languages. But for robots to achieve a similar level of competency, they would need to collect robot data, first-hand, across every object, environment, task, and situation.

In our paper, we introduce Robotic Transformer 2 (RT-2), a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control, while retaining web-scale capabilities.

Read more at Deepmind Blog