All case studies

Robotics / Imitation Learning

Action Trajectory Labeling for a Robotics Lab Training Manipulation Policies

Fine-grained per-frame action segmentation across 220,000 multi-camera frames raised held-out task success from 41% to 73% on a 7-DOF arm. Annotation schema drew from RT-1 and Open X-Embodiment.

RoboticsImitation LearningAction SegmentationMulti-Camera

Client

Robotics research lab

Volume

220,000 multi-camera frames

Duration

11 weeks

Team

18 multimodal annotators, 6 robotics-experienced reviewers

Languages

English (technical)

The challenge

The lab was training imitation learning policies for a 7-DOF arm on 24 manipulation tasks: stacking, pouring, sliding objects, opening drawers, peg insertion.

Their existing dataset was 5,000 episodes with binary success or failure labels. That was enough for ablation studies but not for training a Transformer policy along the lines of RT-1 (Brohan et al., 2022, Google).

The data lacked phase boundaries (approach, grasp, transport, release), object pose ground truth, and aligned gripper-state annotations. Reward shaping was not usable because tasks had no sub-goal labels.

Our approach

Stage 1: Video phase segmentation

Annotators reviewed synchronized footage from four cameras (overhead, two wrist views, third-person). They marked phase boundaries at 30 fps granularity.

  • approach: arm moving toward target object
  • grasp_attempt: gripper closing on object
  • grasp_success: object lifted, contact maintained
  • transport: arm carrying object toward goal
  • release: gripper opening, object placed
  • retreat: arm returning to home pose

Stage 2: Object and gripper annotation

Reference for the labeling schema was BC-Z (Jang et al., 2021, Google) and the multi-embodiment standard introduced in Open X-Embodiment (Padalkar et al., 2023).

  • 3D bounding box on target object every 10 frames
  • Gripper open/close state aligned with timestamped sensor logs
  • Slip detection labels: gripper closed but object not actually grasped

Stage 3: Failure mode labeling

Subtask hierarchy was layered on top: each high-level task decomposed into primitive actions. Failure episodes were classified by mode so the client could weight training samples accordingly.

  • slip, knock-over, mis-grasp, drop, collision, timeout
  • Recovery vs. terminal failure distinction
  • Object-state-at-failure for downstream replay buffers

Domain expertise

Six reviewers had robotics research backgrounds, including former lab members from manipulation research groups. They authored the edge-case guidelines for ambiguous events such as partial slips during contact-rich tasks.

They also flagged 1,200 episodes where the robot's proprioceptive logs did not match the vision frames. This turned out to be a calibration drift on the arm's force sensor, and finding it early saved the client several weeks of model debugging.

Quality assurance

Three-rater consensus on phase boundaries, with a 3-frame tolerance window. 12.4% of frames required adjudication after the first pass. The domain lead audited one task class per week in full.

Results

218,400

Frames annotated

0.91

Phase boundary F1

99.2%

Gripper-state alignment

73% (from 41%)

Policy success (held-out)

What made it work

  • 1

    Single-camera annotation missed slip events that were only visible from the wrist viewpoint. Multi-camera annotation tooling was non-negotiable for contact-rich tasks.

  • 2

    Robotics-experienced reviewers turned out to be the difference between data that trained well and data that looked right but did not help the policy. Schema design needs people who understand what the model can actually learn.

  • 3

    Iterative review loops with the client's ML team during the pilot week avoided shipping 50,000 labels under the wrong schema. Pilots paid for themselves many times over.

  • 4

    Annotating failures and failure modes was more valuable than the team initially expected. The client used the failure-classified subset for offline reinforcement learning experiments.

References

Published research that informed the labeling schema and workflow.

  1. Brohan, A. et al. (2022). RT-1: Robotics Transformer for Real-World Control at Scale. · RSS 2023
  2. Padalkar, A. et al. (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. · Open X-Embodiment Collaboration
  3. Jang, E. et al. (2021). BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning. · CoRL 2021
  4. Mandlekar, A. et al. (2021). What Matters in Learning from Offline Human Demonstrations for Robot Manipulation. · CoRL 2021
  5. Brohan, A. et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. · Google DeepMind

Have a similar project?

Share your data and requirements. We will scope the workflow, team, timeline, and pricing model.