All case studies

Robotics / Vision-Language Foundation Models

Scaling Multi-View Robotic Video Annotation From Manual Process to 1,000-Hour Ramp

How a managed annotation pipeline replaced engineer-led labeling for a robotic foundation model team, hitting October readiness for a November training ramp on 1,000 hours of multi-view video with action, object, and spatial language labels.

RoboticsVision-Language-ActionVideo AnnotationMulti-View

Client

Robotic foundation model startup

Volume

1,000 hours of multi-view robotic video

Duration

8 weeks (pilot + ramp, October readiness for November training)

Team

30 video annotators, 8 reviewers with robotics backgrounds, 2 project leads

Languages

English (technical, spatial language)

The challenge

The client was preparing to scale their training data pipeline ahead of a major model ramp. They had 1,000 hours of multi-view robotic video that needed dense action, object, and spatial-positioning labels before the November training cycle could begin. The deadline was hard. The pipeline was not.

Annotation was still manual and engineer-led. Their robotics engineers were spending evenings and weekends labeling video, which was working at small batches and would not survive contact with a 1,000-hour ramp. Every hour of engineer annotation time was an hour not spent on policy training, dataset curation, or eval infrastructure. The opportunity cost was already showing in R&D velocity.

Quality was the second exposed surface. Manual labeling across rotating engineers produced inconsistent action segmentation, drift in object class taxonomy, and free-form spatial descriptions that the downstream vision-language-action model could not learn from cleanly. Errors in any of the three label streams degraded model reliability in ways that were hard to attribute later.

The combined problem was clear: scale annotation off the engineering team, lock the quality bar before ramp, and hit October readiness so November training could start on schedule.

Our approach

Schema design for vision-language-action data

Label schema was designed against the client's downstream model architecture, drawing on the input format conventions in RT-2 (Brohan et al., 2023, Google DeepMind) and the Open X-Embodiment dataset standard (Padalkar et al., 2023). Action labels, object tracks, and spatial language were specified to feed the VLA training pipeline without reformatting.

  • Action segments: start frame, end frame, action verb, target object, sub-action chain for compound motions
  • Object tracks: per-instance track IDs, bounding boxes at keyframes every 10 frames with interpolation, attribute tags
  • Spatial positioning language: relative-position phrases ("to the left of", "behind", "approaching", "in contact with") tied to the action grammar

Multi-view synchronized annotation

Multi-view robotic data introduces a coordination problem that single-camera annotation tooling cannot handle. The same world event has to be labeled consistently across overhead, wrist, and third-person views. Annotators reviewed all camera streams in a synchronized viewer with shared timeline scrubbing and cross-view event linking.

Spatial language labels were authored against the third-person canonical view but verified against wrist-view contact and overhead-view layout. Disagreements between views surfaced calibration issues and labeling errors that single-view annotation would have shipped.

Pilot week and calibration

The first week ran a representative pilot on 40 hours of video sampled across all task classes the November model needed to learn. The sample deliberately overweighted edge cases: occluded objects, low-light scenes, cluttered backgrounds, and contact-rich phases. Easy examples were not the bottleneck. Hard ones were.

Guidelines went through three revisions during week one. By Friday the schema was stable and inter-annotator agreement on action segments had settled at Cohen's kappa 0.86. Object tracking F1 against reviewer-authored ground truth came in at 0.91. Spatial language agreement, the most subjective track, was held to three-way consensus throughout the project.

Ramp to production throughput

Production batches scaled from 40 hours/week in week two to a steady 140 hours/week from week four onward. The team finished ramp two days inside the October readiness target with a 50-hour buffer for late client-driven schema additions.

  • 30 video annotators on action and object tracks
  • 8 reviewers with robotics or video understanding backgrounds owning QA and adjudication
  • 2 project leads coordinating delivery, schema versioning, and client check-ins
  • Daily QA sample of 5% of delivered video, full audit on any class showing IAA drift

Engineering time freed

The headline outcome the client cared about was not the annotation cost. It was the engineering hours that did not have to be spent labeling. Based on their estimated baseline pace before the engagement, this work would have absorbed roughly 3,000 engineer-hours had it stayed in-house. Those hours went back to model training and eval infrastructure.

Results

1,000 hours

Video annotated

0.86

Action segment IAA (kappa)

0.91

Object tracking F1

Hit, 50-hour buffer

October readiness

What made it work

  • 1

    The expensive resource in robotics training data is not annotation. It is engineering time. Moving annotation off the R&D team paid back more than it cost in dollar terms, before any quality discussion.

  • 2

    Multi-view annotation cannot be parallelized as independent per-camera passes. Spatial labels require a shared timeline and cross-view linking from day one or the labels do not reconcile.

  • 3

    Schema designed against the downstream VLA model format avoided a reformatting pass at delivery. The labels fed training directly. This is the most common avoidable cost in robotics annotation engagements.

  • 4

    Pilot week overweighting edge cases caught the schema problems that random sampling would have shipped. Easy examples calibrate nothing useful.

References

Published research that informed the labeling schema and workflow.

  1. Brohan, A. et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. · Google DeepMind
  2. Padalkar, A. et al. (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. · Open X-Embodiment Collaboration
  3. Khazatsky, A. et al. (2024). DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset. · RSS 2024
  4. Octo Model Team (2024). Octo: An Open-Source Generalist Robot Policy. · Berkeley AI Research
  5. Driess, D. et al. (2023). PaLM-E: An Embodied Multimodal Language Model. · ICML 2023

Have a similar project?

Share your data and requirements. We will scope the workflow, team, timeline, and pricing model.