Robotics 2
♻ ☆ Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation
Abhiram Maddukuri, Zhenyu Jiang, Lawrence Yunliang Chen, Soroush Nasiriany, Yuqi Xie, Yu Fang, Wenqi Huang, Zu Wang, Zhenjia Xu, Nikita Chernyadev, Scott Reed, Ken Goldberg, Ajay Mandlekar, Linxi Fan, Yuke Zhu
Large real-world robot datasets hold great potential to train generalist
robot models, but scaling real-world human data collection is time-consuming
and resource-intensive. Simulation has great potential in supplementing
large-scale data, especially with recent advances in generative AI and
automated data generation tools that enable scalable creation of robot behavior
datasets. However, training a policy solely in simulation and transferring it
to the real world often demands substantial human effort to bridge the reality
gap. A compelling alternative is to co-train the policy on a mixture of
simulation and real-world datasets. Preliminary studies have recently shown
this strategy to substantially improve the performance of a policy over one
trained on a limited amount of real-world data. Nonetheless, the community
lacks a systematic understanding of sim-and-real co-training and what it takes
to reap the benefits of simulation data for real-robot learning. This work
presents a simple yet effective recipe for utilizing simulation data to solve
vision-based robotic manipulation tasks. We derive this recipe from
comprehensive experiments that validate the co-training strategy on various
simulation and real-world datasets. Using two domains--a robot arm and a
humanoid--across diverse tasks, we demonstrate that simulation data can enhance
real-world task performance by an average of 38%, even with notable differences
between the simulation and real-world data. Videos and additional results can
be found at https://co-training.github.io/
comment: Project website: https://co-training.github.io/
♻ ☆ Can DeepSeek Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery
DeepSeek series have demonstrated outstanding performance in general scene
understanding, question-answering (QA), and text generation tasks, owing to its
efficient training paradigm and strong reasoning capabilities. In this study,
we investigate the dialogue capabilities of the DeepSeek model in robotic
surgery scenarios, focusing on tasks such as Single Phrase QA, Visual QA, and
Detailed Description. The Single Phrase QA tasks further include sub-tasks such
as surgical instrument recognition, action understanding, and spatial position
analysis. We conduct extensive evaluations using publicly available datasets,
including EndoVis18 and CholecT50, along with their corresponding dialogue
data. Our comprehensive evaluation results indicate that, when provided with
specific prompts, DeepSeek-V3 performs well in surgical instrument and tissue
recognition tasks However, DeepSeek-V3 exhibits significant limitations in
spatial position analysis and struggles to understand surgical actions
accurately. Additionally, our findings reveal that, under general prompts,
DeepSeek-V3 lacks the ability to effectively analyze global surgical concepts
and fails to provide detailed insights into surgical scenarios. Based on our
observations, we argue that the DeepSeek-V3 is not ready for vision-language
tasks in surgical contexts without fine-tuning on surgery-specific datasets.
comment: Technical Report