AI Research

RT-2: New model translates vision and language into action

Published

2 years ago

July 28, 2023

The Editors

Research

Published: 28 July 2023
Authors: Yevgen Chebotar, Tianhe Yu

Robotic Transformer 2 (RT-2) is a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control

High-capacity vision-language models (VLMs) are trained on web-scale datasets, making these systems remarkably good at recognising visual or language patterns and operating across different languages. But for robots to achieve a similar level of competency, they would need to collect robot data, first-hand, across every object, environment, task, and situation.

In our paper, we introduce Robotic Transformer 2 (RT-2), a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control, while retaining web-scale capabilities.

A visual-language model (VLM) pre-trained on web-scale data is learning from RT-1 robotics data to become RT-2, a visual-language-action (VLA) model that can control a robot.

This work builds upon Robotic Transformer 1 (RT-1), a model trained on multi-task demonstrations, which can learn combinations of tasks and objects seen in the robotic data. More specifically, our work used RT-1 robot demonstration data that was collected with 13 robots over 17 months in an office kitchen environment.

RT-2 shows improved generalisation capabilities and semantic and visual understanding beyond the robotic data it was exposed to. This includes interpreting new commands and responding to user commands by performing rudimentary reasoning, such as reasoning about object categories or high-level descriptions.

We also show that incorporating chain-of-thought reasoning allows RT-2 to perform multi-stage semantic reasoning, like deciding which object could be used as an improvised hammer (a rock), or which type of drink is best for a tired person (an energy drink).

Adapting VLMs for robotic control

RT-2 builds upon VLMs that take one or more images as input, and produces a sequence of tokens that, conventionally, represent natural language text. Such VLMs have been successfully trained on web-scale data to perform tasks, like visual question answering, image captioning, or object recognition. In our work, we adapt Pathways Language and Image model (PaLI-X) and Pathways Language model Embodied (PaLM-E) to act as the backbones of RT-2.

To control a robot, it must be trained to output actions. We address this challenge by representing actions as tokens in the model’s output – similar to language tokens – and describe actions as strings that can be processed by standard natural language tokenizers, shown here:

Representation of an action string used in RT-2 training. An example of such a string could be a sequence of robot action token numbers, e.g.“1 128 91 241 5 101 127 217”.

The string starts with a flag that indicates whether to continue or terminate the current episode, without executing the subsequent commands, and follows with the commands to change position and rotation of the end-effector, as well as the desired extension of the robot gripper.

We use the same discretised version of robot actions as in RT-1, and show that converting it to a string representation makes it possible to train VLM models on robotic data – as the input and output spaces of such models don’t need to be changed.

RT-2 architecture and training: We co-fine-tune a pre-trained VLM model on robotics and web data. The resulting model takes in robot camera images and directly predicts actions for a robot to perform.

Generalisation and emergent skills

We performed a series of qualitative and quantitative experiments on our RT-2 models, on over 6,000 robotic trials. Exploring RT-2’s emergent capabilities, we first searched for tasks that would require combining knowledge from web-scale data and the robot’s experience, and then defined three categories of skills: symbol understanding, reasoning, and human recognition.

Each task required understanding visual-semantic concepts and the ability to perform robotic control to operate on these concepts. Commands such as “pick up the bag about to fall off the table” or “move banana to the sum of two plus one” – where the robot is asked to perform a manipulation task on objects or scenarios never seen in the robotic data – required knowledge translated from web-based data to operate.

Examples of emergent robotic skills that are not present in the robotics data and require knowledge transfer from web pre-training.

Across all categories, we observed increased generalisation performance (more than 3x improvement) compared to previous baselines, such as previous RT-1 models and models like Visual Cortex (VC-1), which were pre-trained on large visual datasets.

Success rates of emergent skill evaluations: our RT-2 models outperform both previous robotics transformer (RT-1) and visual pre-training (VC-1) baselines.

We also performed a series of quantitative evaluations, beginning with the original RT-1 tasks, for which we have examples in the robot data, and continued with varying degrees of previously unseen objects, backgrounds, and environments by the robot that required the robot to learn generalisation from VLM pre-training.

Examples of previously unseen environments by the robot, where RT-2 generalises to novel situations.

RT-2 retained the performance on the original tasks seen in robot data and improved performance on previously unseen scenarios by the robot, from RT-1’s 32% to 62%, showing the considerable benefit of the large-scale pre-training.

Additionally, we observed significant improvements over baselines pre-trained on visual-only tasks, such as VC-1 and Reusable Representations for Robotic Manipulation (R3M), and algorithms that use VLMs for object identification, such as Manipulation of Open-World Objects (MOO).

RT-2 achieves high performance on seen in-distribution tasks and outperforms multiple baselines on out-of-distribution unseen tasks.

Evaluating our model on the open-source Language Table suite of robotic tasks, we achieved a success rate of 90% in simulation, substantially improving over the previous baselines including BC-Z (72%), RT-1 (74%), and LAVA (77%).

Then we evaluated the same model in the real world (since it was trained on simulation and real data), and demonstrated its ability to generalise to novel objects, as shown below, where none of the objects except the blue cube were present in the training dataset.

RT-2 performs well on real robot Language Table tasks. None of the objects except the blue cube were present in the training data.

Inspired by chain-of-thought prompting methods used in LLMs, we probed our models to combine robotic control with chain-of-thought reasoning to enable learning long-horizon planning and low-level skills within a single model.

In particular, we fine-tuned a variant of RT-2 for just a few hundred gradient steps to increase its ability to use language and actions jointly. Then we augmented the data to include an additional “Plan” step, first describing the purpose of the action that the robot is about to take in natural language, followed by “Action” and the action tokens. Here we show an example of such reasoning and the robot’s resulting behaviour:

Chain-of-thought reasoning enables learning a self-contained model that can both plan long-horizon skill sequences and predict robot actions.

With this process, RT-2 can perform more involved commands that require reasoning about intermediate steps needed to accomplish a user instruction. Thanks to its VLM backbone, RT-2 can also plan from both image and text commands, enabling visually grounded planning, whereas current plan-and-act approaches like SayCan cannot see the real world and rely entirely on language.

Advancing robotic control

RT-2 shows that vision-language models (VLMs) can be transformed into powerful vision-language-action (VLA) models, which can directly control a robot by combining VLM pre-training with robotic data.

With two instantiations of VLAs based on PaLM-E and PaLI-X, RT-2 results in highly-improved robotic policies, and, more importantly, leads to significantly better generalisation performance and emergent capabilities, inherited from web-scale vision-language pre-training.

RT-2 is not only a simple and effective modification over existing VLM models, but also shows the promise of building a general-purpose physical robot that can reason, problem solve, and interpret information for performing a diverse range of tasks in the real-world.

Acknowledgements

We would like to thank the co-authors of this work: Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu and Brianna Zitkovich for their contributions to the project and Fred Alcober, Jodi Lynn Andres, Carolina Parada, Joseph Dabis, Rochelle Dela Cruz, Jessica Gomez, Gavin Gonzalez, John Guilyard, Tomas Jackson, Jie Tan, Scott Lehrer, Dee M, Utsav Malla, Sarah Nguyen, Jane Park, Emily Perez, Elio Prado, Jornell Quiambao, Clayton Tan, Jodexty Therlonge, Eleanor Tomlinson, Wenxuan Zhou, and the greater Google DeepMind team for their help and feedback.

Source link

AI Research

If I Could Only Buy 1 Artificial Intelligence (AI) Chip Stock Over The Next 10 Years, This Would Be It (Hint: It’s Not Nvidia)

Published

3 hours ago

August 31, 2025

Adam Spatacco

While Nvidia continues to capture headlines, a critical enabler of the artificial intelligence (AI) infrastructure boom may be better positioned for long-term gains.

When investors debate the future of the artificial intelligence (AI) trade, the conversation generally finds its way back to the usual suspects: Nvidia, Advanced Micro Devices, and cloud hyperscalers like Microsoft, Amazon, and Alphabet.

Each of these companies is racing to design GPUs or develop custom accelerators in-house. But behind this hardware, there’s a company that benefits no matter which chip brand comes out ahead: Taiwan Semiconductor Manufacturing (TSM -3.05%).

Let’s unpack why Taiwan Semi is my top AI chip stock over the next 10 years, and assess whether now is an opportune time to scoop up some shares.

Agnostic to the winner, leveraged to the trend

As the world’s leading semiconductor foundry, TSMC manufactures chips for nearly every major AI developer — from Nvidia and AMD to Amazon’s custom silicon initiatives, dubbed Trainium and Inferentia.

Unlike many of its peers in the chip space that rely on new product cycles to spur demand, Taiwan Semi’s business model is fundamentally agnostic. Whether demand is allocated toward GPUs, accelerators, or specialized cloud silicon, all roads lead back to TSMC’s fabrication capabilities.

With nearly 70% market share in the global foundry space, Taiwan Semi’s dominance is hard to ignore. Such a commanding lead over the competition provides the company with unmatched structural demand visibility — a trend that appears to be accelerating as AI infrastructure spend remains on the rise.

Image source: Getty Images.

Scaling with more sophisticated AI applications

At the moment, AI development is still concentrated on training and refining large language models (LLMs) and embedding them into downstream software applications.

The next wave of AI will expand into far more diverse and demanding use cases — autonomous systems, robotics, and quantum computing remain in their infancy. At scale, these workloads will place greater demands on silicon than today’s chips can support.

Meeting these demands doesn’t simply require additional investments in chips. Rather, it requires chips engineered for new levels of efficiency, performance, and power management. This is where TSMC’s competitive advantages begin to compound.

With each successive generation of process technology, the company has a unique opportunity to widen the performance gap between itself and rivals like Samsung or Intel.

Since Taiwan Semi already has such a large footprint in the foundry landscape, next-generation design complexities give the company a chance to further lock in deeper, stickier customer relationships.

TSMC’s valuation and the case for expansion

Taiwan Semi may trade at a forward price-to-earnings (P/E) ratio of 24, but dismissing the stock as “expensive” overlooks the company’s extraordinary positioning in the AI realm. To me, the company’s valuation reflects a robust growth outlook, improving earnings prospects, and a declining risk premium.

TSM PE Ratio (Forward) data by YCharts

Unlike many of its semiconductor peers, which are vulnerable to cyclicality headwinds, TSMC has become an indispensable utility for many of the world’s largest AI developers, evolving into one of the backbones of the ongoing infrastructure boom.

The scale of investment behind current AI infrastructure is jaw-dropping. Hyperscalers are investing staggering sums to expand and modernize data centers, and at the heart of each new buildout is an unrelenting demand for more chips. Moreover, each of these companies is exploring more advanced use cases that will, at some point, require next-generation processing capabilities.

These dynamics position Taiwan Semi at the crossroad of immediate growth and enduring long-term expansion, as AI infrastructure swiftly evolves from a constant driver of growth today into a multidecade secular theme.

TSMC’s manufacturing dominance ensures that its services will continue to witness robust demand for years to come. For this reason, I think Taiwan Semi is positioned to experience further valuation expansion over the next decade as the infrastructure chapter of the AI story continues to unfold.

While there are many great opportunities in the chip space, TSMC stands alone. I see it as perhaps the most unique, durable semiconductor stock to own amid a volatile technology landscape over the next several years.

Adam Spatacco has positions in Alphabet, Amazon, Microsoft, and Nvidia. The Motley Fool has positions in and recommends Advanced Micro Devices, Alphabet, Amazon, Intel, Microsoft, Nvidia, and Taiwan Semiconductor Manufacturing. The Motley Fool recommends the following options: long January 2026 $395 calls on Microsoft, short August 2025 $24 calls on Intel, short January 2026 $405 calls on Microsoft, and short November 2025 $21 puts on Intel. The Motley Fool has a disclosure policy.

Source link

AI Research

Researchers train AI to diagnose heart failure in rural patients using low-tech electrocardiograms

Published

4 hours ago

August 31, 2025

Andrew Zinin

WVU computer scientists are training AI models to diagnose heart failure using data generated by low-tech equipment widely available in rural Appalachian medical practices. Credit: WVU/Micaela Morrissette

Concerned about the ability of artificial intelligence models trained on data from urban demographics to make the right medical diagnoses for rural populations, West Virginia University computer scientists have developed several AI models that can identify signs of heart failure in patients from Appalachia.

Prashnna Gyawali, assistant professor in the Lane Department of Computer Science and Electrical Engineering at the WVU Benjamin M. Statler College of Engineering and Mineral Resources, said heart failure—a chronic, persistent condition in which the heart cannot pump enough blood to meet the body’s need for oxygen—is one of the most pressing national and global health issues, and one that hits rural regions of the U.S. especially hard.

Despite the outsized impact of heart failure on rural populations, AI models are currently being trained to diagnose the disease using data representing patients from urban and suburban areas like Stanford, California, Gyawali said.

“Imagine Jane Doe, a 62-year-old woman living in a rural Appalachian community,” he suggested. “She has limited access to specialty care, relies on a small local clinic, and her lifestyle, diet and health history reflect the realities of her environment: high physical labor, minimal preventive care, and increased exposure to environmental risk factors like coal dust or poor air quality. Jane begins to experience fatigue and shortness of breath—symptoms that could point to heart failure.

“An AI system, trained primarily on data from urban hospitals in more affluent, coastal areas, evaluates Jane’s lab results. But because the system was not trained on patients who share Jane’s socioeconomic and environmental context, it fails to recognize her condition as urgent or abnormal,” Gyawali said. “This is why this work matters. By training AI models on data from West Virginia patients, we aim to ensure people like Jane receive accurate diagnoses, no matter where they live or how their lives differ from national averages.”

The researchers identified the AI models that were most accurate at diagnosing heart failure in an anonymized sample of more than 55,000 patients who received medical care in West Virginia. They also pinpointed the exact parameters for providing the AI models with data that most enhanced diagnostic accuracy. The findings appear in Scientific Reports, a Nature portfolio journal.

Doctoral student Alina Devkota emphasized they trained the AI models to work from patients’ electrocardiogram results, rather than the echocardiogram readings typical for patient data from urban areas.

Electrocardiograms rely on round electrodes stuck to the patient’s torso to record electrical signals from the heart. According to Devkota, they don’t require specialized equipment or specialized training to operate, but they still provide valuable insights into heart function.

“One of the criteria to diagnose heart failure is by measuring the ‘ejection fraction,’ or how much blood is pumped out of the heart with every beat, and the gold standard for doing that is with echocardiography, which uses sound waves to create images of the heart and the blood flowing through its valves,” she said.

“But echocardiography is expensive, time-consuming and often unavailable to patients in the very same rural Appalachian states that have the highest prevalence of heart failure across the nation. West Virginia, for example, ranks first in the U.S. for the prevalence of heart attack and coronary heart disease, but many West Virginians don’t have local access to high-tech echocardiograms. They do have access to inexpensive electrocardiograms, so we tested whether AI models could use electrocardiogram readings to predict a patient’s ejection fraction.”

Devkota, Gyawali and their colleagues trained several AI models on patient records from 28 hospitals across West Virginia. The AI models used either “deep learning,” which relies on multilayered neural networks, or “non-deep learning,” which relies on simpler algorithms, to analyze the patient records and draw conclusions.

The researchers found the deep-learning models, particularly one called ResNet, did best at correctly predicting a patient’s ejection fraction based on data from 12-lead electrocardiograms, with the results suggesting that a larger dataset for training would yield even better results. They also found that providing the AI models with specific “leads,” or combinations of data from different electrode pairs, affected how accurate the models’ ejection fraction predictions were.

Gyawali said while AI models are not yet being used in clinical practice due to reliability concerns, training an AI to successfully estimate ejection fraction from electrocardiogram signals could soon give clinicians an edge in protecting patients’ cardiac health.

“Heart failure affects more than six million Americans today, and factors like our aging population mean the risk is growing rapidly—approximately 1 in 4 people alive today will experience heart failure during their lifetimes. The prevalence is even higher in rural Appalachia, so it’s critical the people here do not continue to be overlooked.”

Additional WVU contributors to the research included Rukesh Prajapati, graduate research assistant; Amr El-Wakeel, assistant professor; Donald Adjeroh, professor and chair for computer science; and Brijesh Patel, assistant professor in the WVU Health Sciences School of Medicine.

More information:
AI analysis for ejection fraction estimation from 12-lead ECG, Scientific Reports (2025). DOI: 10.1038/s41598-025-97113-0scientific

Provided by
West Virginia University

Citation:
Researchers train AI to diagnose heart failure in rural patients using low-tech electrocardiograms (2025, August 31)
retrieved 31 August 2025
from https://medicalxpress.com/news/2025-08-ai-heart-failure-rural-patients.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Source link