Robotics / Imitation Learning

Action Trajectory Labeling for a Robotics Lab Training Manipulation Policies

Fine-grained per-frame action segmentation across 220,000 multi-camera frames raised held-out task success from 41% to 73% on a 7-DOF arm. Annotation schema drew from RT-1 and Open X-Embodiment.

RoboticsImitation LearningAction SegmentationMulti-Camera

Client

Robotics research lab

Volume

220,000 multi-camera frames

Duration

11 weeks

Team

18 multimodal annotators, 6 robotics-experienced reviewers

Languages

English (technical)

The challenge

The lab was training imitation learning policies for a 7-DOF arm on 24 manipulation tasks: stacking, pouring, sliding objects, opening drawers, peg insertion.

Their existing dataset was 5,000 episodes with binary success or failure labels. That was enough for ablation studies but not for training a Transformer policy along the lines of RT-1 (Brohan et al., 2022, Google).

The data lacked phase boundaries (approach, grasp, transport, release), object pose ground truth, and aligned gripper-state annotations. Reward shaping was not usable because tasks had no sub-goal labels.

Our approach

Stage 1: Video phase segmentation

Annotators reviewed synchronized footage from four cameras (overhead, two wrist views, third-person). They marked phase boundaries at 30 fps granularity.

approach: arm moving toward target object
grasp_attempt: gripper closing on object
grasp_success: object lifted, contact maintained
transport: arm carrying object toward goal
release: gripper opening, object placed
retreat: arm returning to home pose

Stage 2: Object and gripper annotation

Reference for the labeling schema was BC-Z (Jang et al., 2021, Google) and the multi-embodiment standard introduced in Open X-Embodiment (Padalkar et al., 2023).

3D bounding box on target object every 10 frames
Gripper open/close state aligned with timestamped sensor logs
Slip detection labels: gripper closed but object not actually grasped

Stage 3: Failure mode labeling

Subtask hierarchy was layered on top: each high-level task decomposed into primitive actions. Failure episodes were classified by mode so the client could weight training samples accordingly.

slip, knock-over, mis-grasp, drop, collision, timeout
Recovery vs. terminal failure distinction
Object-state-at-failure for downstream replay buffers

Domain expertise

Six reviewers had robotics research backgrounds, including former lab members from manipulation research groups. They authored the edge-case guidelines for ambiguous events such as partial slips during contact-rich tasks.

They also flagged 1,200 episodes where the robot's proprioceptive logs did not match the vision frames. This turned out to be a calibration drift on the arm's force sensor, and finding it early saved the client several weeks of model debugging.

Quality assurance

Three-rater consensus on phase boundaries, with a 3-frame tolerance window. 12.4% of frames required adjudication after the first pass. The domain lead audited one task class per week in full.

Results

218,400

Frames annotated

0.91

Phase boundary F1

99.2%

Gripper-state alignment

73% (from 41%)

Policy success (held-out)

What made it work

1
Single-camera annotation missed slip events that were only visible from the wrist viewpoint. Multi-camera annotation tooling was non-negotiable for contact-rich tasks.
2
Robotics-experienced reviewers turned out to be the difference between data that trained well and data that looked right but did not help the policy. Schema design needs people who understand what the model can actually learn.
3
Iterative review loops with the client's ML team during the pilot week avoided shipping 50,000 labels under the wrong schema. Pilots paid for themselves many times over.
4
Annotating failures and failure modes was more valuable than the team initially expected. The client used the failure-classified subset for offline reinforcement learning experiments.

References

Published research that informed the labeling schema and workflow.

Brohan, A. et al. (2022). RT-1: Robotics Transformer for Real-World Control at Scale. · RSS 2023
Padalkar, A. et al. (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. · Open X-Embodiment Collaboration
Jang, E. et al. (2021). BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning. · CoRL 2021
Mandlekar, A. et al. (2021). What Matters in Learning from Offline Human Demonstrations for Robot Manipulation. · CoRL 2021
Brohan, A. et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. · Google DeepMind

More case studies

Generative Video / Image Quality Assessment

Subjective Video Quality Scoring at 98% Agreement for a Generative Video Model Team

Document AI / Financial Services

Structured Extraction From 50,000 Financial Documents for a Document AI Vendor

Healthcare / Medical Imaging

Whole-Slide Pathology Annotation for a Histopathology AI Vendor

Agentic AI / AI Safety Evaluation

Decision-Quality Annotation for an Agentic AI in Security Incident Response

Robotics / Vision-Language Foundation Models

Scaling Multi-View Robotic Video Annotation From Manual Process to 1,000-Hour Ramp

Have a similar project?

Share your data and requirements. We will scope the workflow, team, timeline, and pricing model.

Start a Pilot Explore Services