Robotics / Imitation Learning
Action Trajectory Labeling for a Robotics Lab Training Manipulation Policies
Fine-grained per-frame action segmentation across 220,000 multi-camera frames raised held-out task success from 41% to 73% on a 7-DOF arm. Annotation schema drew from RT-1 and Open X-Embodiment.
Client
Robotics research lab
Volume
220,000 multi-camera frames
Duration
11 weeks
Team
18 multimodal annotators, 6 robotics-experienced reviewers
Languages
English (technical)
The challenge
The lab was training imitation learning policies for a 7-DOF arm on 24 manipulation tasks: stacking, pouring, sliding objects, opening drawers, peg insertion.
Their existing dataset was 5,000 episodes with binary success or failure labels. That was enough for ablation studies but not for training a Transformer policy along the lines of RT-1 (Brohan et al., 2022, Google).
The data lacked phase boundaries (approach, grasp, transport, release), object pose ground truth, and aligned gripper-state annotations. Reward shaping was not usable because tasks had no sub-goal labels.
Our approach
Stage 1: Video phase segmentation
Annotators reviewed synchronized footage from four cameras (overhead, two wrist views, third-person). They marked phase boundaries at 30 fps granularity.
- approach: arm moving toward target object
- grasp_attempt: gripper closing on object
- grasp_success: object lifted, contact maintained
- transport: arm carrying object toward goal
- release: gripper opening, object placed
- retreat: arm returning to home pose
Stage 2: Object and gripper annotation
Reference for the labeling schema was BC-Z (Jang et al., 2021, Google) and the multi-embodiment standard introduced in Open X-Embodiment (Padalkar et al., 2023).
- 3D bounding box on target object every 10 frames
- Gripper open/close state aligned with timestamped sensor logs
- Slip detection labels: gripper closed but object not actually grasped
Stage 3: Failure mode labeling
Subtask hierarchy was layered on top: each high-level task decomposed into primitive actions. Failure episodes were classified by mode so the client could weight training samples accordingly.
- slip, knock-over, mis-grasp, drop, collision, timeout
- Recovery vs. terminal failure distinction
- Object-state-at-failure for downstream replay buffers
Domain expertise
Six reviewers had robotics research backgrounds, including former lab members from manipulation research groups. They authored the edge-case guidelines for ambiguous events such as partial slips during contact-rich tasks.
They also flagged 1,200 episodes where the robot's proprioceptive logs did not match the vision frames. This turned out to be a calibration drift on the arm's force sensor, and finding it early saved the client several weeks of model debugging.
Quality assurance
Three-rater consensus on phase boundaries, with a 3-frame tolerance window. 12.4% of frames required adjudication after the first pass. The domain lead audited one task class per week in full.
Results
218,400
Frames annotated
0.91
Phase boundary F1
99.2%
Gripper-state alignment
73% (from 41%)
Policy success (held-out)
What made it work
- 1
Single-camera annotation missed slip events that were only visible from the wrist viewpoint. Multi-camera annotation tooling was non-negotiable for contact-rich tasks.
- 2
Robotics-experienced reviewers turned out to be the difference between data that trained well and data that looked right but did not help the policy. Schema design needs people who understand what the model can actually learn.
- 3
Iterative review loops with the client's ML team during the pilot week avoided shipping 50,000 labels under the wrong schema. Pilots paid for themselves many times over.
- 4
Annotating failures and failure modes was more valuable than the team initially expected. The client used the failure-classified subset for offline reinforcement learning experiments.
References
Published research that informed the labeling schema and workflow.
- Brohan, A. et al. (2022). RT-1: Robotics Transformer for Real-World Control at Scale. · RSS 2023
- Padalkar, A. et al. (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. · Open X-Embodiment Collaboration
- Jang, E. et al. (2021). BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning. · CoRL 2021
- Mandlekar, A. et al. (2021). What Matters in Learning from Offline Human Demonstrations for Robot Manipulation. · CoRL 2021
- Brohan, A. et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. · Google DeepMind
More case studies
Generative Video / Image Quality Assessment
Subjective Video Quality Scoring at 98% Agreement for a Generative Video Model Team
Document AI / Financial Services
Structured Extraction From 50,000 Financial Documents for a Document AI Vendor
Healthcare / Medical Imaging
Whole-Slide Pathology Annotation for a Histopathology AI Vendor
Agentic AI / AI Safety Evaluation
Decision-Quality Annotation for an Agentic AI in Security Incident Response
Robotics / Vision-Language Foundation Models
Scaling Multi-View Robotic Video Annotation From Manual Process to 1,000-Hour Ramp
Have a similar project?
Share your data and requirements. We will scope the workflow, team, timeline, and pricing model.