All posts
RoboticsRoboticsImitation LearningAction AnnotationMulti-Camera

Why Robotics Datasets Need More Than Bounding Boxes

Bounding boxes annotate what is in a frame. Robotics policies need to know what is happening between frames. RT-1, BC-Z, and Open X-Embodiment all hinge on a more demanding kind of label.

2025-047 min read

If you have ever opened a robotics dataset and found a folder of MP4s with timestamps but no phase labels, you know the gap. Vision models can learn from boxed objects. Robotics policies cannot. The arm needs to know not just what is in the frame but what the human in the demonstration was trying to do at frame 247 versus frame 312.

The papers behind modern manipulation policies make this very clear. RT-1 (Brohan et al., 2022) trains on 130K demonstrations with action tokens at each timestep. BC-Z (Jang et al., 2021) requires task descriptions paired with per-trajectory labels. Open X-Embodiment (Padalkar et al., 2023) only works because dozens of labs agreed on a common action format.

Phase boundaries are the unit of work

Manipulation tasks decompose into phases. Approach the object. Attempt grasp. Confirm grasp. Transport. Release. Retreat. Each phase has a beginning and end you can see in the video. Annotating those boundaries is the difference between a policy that imitates motion and a policy that imitates intent.

The hard part is consistency. Different annotators draw boundaries at slightly different frames. Three-rater consensus with a three-frame tolerance window works in practice. Beyond that, disagreement compounds quickly because errors at phase boundaries cascade through the rest of the labels.

Single-camera annotation misses things

Wrist cameras see what the gripper sees. Overhead cameras see the layout. Third-person cameras see the whole scene. A slip event during a grasp can be invisible from above and obvious from the wrist. Failure modes that look identical from one angle are clearly different from another.

Synchronized multi-camera annotation tooling is not a nice-to-have for contact-rich tasks. It catches failures that single-camera review misses, and those missed failures are exactly the ones the policy ends up reproducing.

Failures are training data

Binary success or failure labels are the worst form of ground truth. They tell the policy what happened but not why. Failure mode taxonomies (slip, knock-over, mis-grasp, drop, collision, timeout) convert failures from noise into structured training signal.

Mandlekar et al. (2021) showed that what matters in learning from offline demonstrations includes which failures are in the dataset. Curated failure data, properly labeled, is sometimes more useful than additional successes.

  • Recovery vs. terminal failure distinction
  • Object pose at moment of failure
  • Causal hypothesis from the annotator (slip vs. collision vs. plan error)

Gripper state, but verified

Most robot platforms log gripper open/close state via a binary sensor. The sensor is correct most of the time. When it is wrong, it is wrong silently. Annotators reviewing video catch the cases where the gripper closed but the object was not actually grasped, or where the gripper opened during transport without an explicit release command.

Cross-checking sensor logs against video annotation is the kind of QA pass that nobody plans for and everybody needs. We saw a project where 1,200 episodes had calibration drift on the force sensor that only surfaced because annotators flagged the mismatch.

The shape of useful robotics data

The minimum useful annotation for a manipulation dataset is: phase boundaries at 30fps granularity, 3D object pose at keyframes, gripper state aligned with sensor logs, and failure mode labels for unsuccessful episodes. Anything less and the policy ends up memorizing motion. Anything more and the annotation cost stops paying back.

References

  1. Brohan, A. et al. (2022). RT-1. · RSS 2023
  2. Padalkar, A. et al. (2023). Open X-Embodiment. · Open X-Embodiment Collaboration
  3. Jang, E. et al. (2021). BC-Z. · CoRL 2021
  4. Mandlekar, A. et al. (2021). What Matters in Learning from Offline Human Demonstrations. · CoRL 2021

Want this kind of work on your data?

Tell us about your annotation, validation, or evaluation project. We respond with a scoping note within one business day.