Generative Video / Image Quality Assessment
Subjective Video Quality Scoring at 98% Agreement for a Generative Video Model Team
Cross-continental rater pool across the USA, UK, India, and Bangladesh scored 16,000 model-generated videos on noise, sharpness, exposure, color, and overall quality. Pairwise preferences across A/B variants and free-text reasoning fed reward modeling. The subjective brief made the 98% target the hard part.
Client
Generative video model team
Volume
16,000 videos, ≈80,000 A/B pairs, ≈640,000 metric judgments
Duration
12 weeks
Team
60 trained raters across USA, UK, India, and Bangladesh, 8 senior reviewers, 2 calibration leads
Languages
English
The challenge
The client was training and evaluating a generative video model. Quality was not a single objective metric. It was five subjective metrics layered together: noise, sharpness, exposure, color, and an overall quality score. Each metric needed per-video ratings on a 1 to 5 scale plus pairwise preferences between A/B model variants for the same prompt.
Subjective ratings at scale are hard to standardize. Sharpness and noise can correlate, because denoising softens detail. Exposure judgments depend on display calibration. Color perception drifts by region and culture. Overall quality is informed by the four sub-metrics but does not always equal their average. The client wanted clean reward modeling signal, which meant the raters had to behave like one coherent system, not 60 independent judges.
Hitting 98% agreement on a subjective task is unusual. Most subjective video quality work in the literature reports kappa in the 0.6 to 0.8 range. The client needed reproducible scoring that the model could learn from without picking up rater-specific bias. The brief was demanding, and it took a different annotation design than a standard image labeling project.
Our approach
Five-metric rating schema
Every video was rated on the client's five quality dimensions, each on a 1 to 5 scale with anchored definitions written into the guideline. The overall score followed published anchor language so it stayed comparable across raters and across regions.
- Noise: low-level artifacts, grain, compression blocks, temporal flicker
- Sharpness: edge clarity, focus, detail preservation against denoising softness
- Exposure: brightness range, clipping, dark crush, scene-appropriate dynamic range
- Color: saturation, white balance, hue accuracy against scene context
- Overall: 5 excellent, 4 minor imperfections, 3 visible but usable, 2 multiple clear issues, 1 severely degraded
Pairwise A/B preference per metric
Beyond single-video ratings, raters compared A and B variants from the same generative prompt and chose a preferred variant per metric. This pairwise mode is what the client's reward model consumed. The methodology drew on RLHF-style preference annotation as used in InstructGPT (Ouyang et al., 2022) and recent generative-video preference work in VBench (Huang et al., 2024).
Reasoning capture was offered as an optional free-text field. About 40% of pairs came back with reasoning. The optional structure was intentional. Required reasoning produces shallow text that adds noise. Optional reasoning gets used by raters who genuinely have something to say, and those notes turned out to be the most valuable downstream signal.
Cross-continental rater pool
The 60-rater pool was deliberately distributed across the USA, UK, India, and Bangladesh. Color perception and viewing-condition norms vary by region, and a single-country rater pool would have produced a model that worked well for that region and drifted elsewhere. Distributing raters across four countries forced the consensus to cover the diversity the client's production traffic would see.
Each region had its own calibration sub-lead who knew the local conventions. Disagreements that traced to regional perception (saturated reds preferred in one market, more muted in another) were surfaced explicitly rather than averaged away, and the client used those signals to inform downstream model behavior.
Calibration anchors and pilot
The pilot week ran 400 anchor videos through every rater. Anchors had pre-determined target ratings authored by the client's image quality team. Raters whose scores drifted more than one step from anchors were retrained before joining production. Anchors were re-sampled every week during production to catch drift over time.
- Per-metric calibration thresholds with retraining trigger at >1 step drift
- Weekly anchor sampling on 5% of in-flight work
- Display calibration check at workstation setup
- Reference videos in the guideline for each 1 to 5 rating
How we reached 98% agreement
Subjective scoring at 98% inter-rater agreement does not come from any single fix. It came from stacking the right design choices.
- Anchored definitions per metric instead of free-text descriptions
- Pairwise preferences as the primary annotation mode, with single-video ratings as a secondary signal that pairwise verifies against
- Multi-region calibration to surface and handle regional bias instead of averaging it away
- Senior reviewer adjudication on any pair where regional sub-leads disagreed
- Iterative guideline revision driven by the optional reasoning corpus, which exposed where the schema was unclear
- Hard rejection of raters who failed weekly anchor re-checks
Results
98%
Inter-rater agreement
16,000
Videos rated
≈80,000
Pairwise judgments
USA · UK · India · Bangladesh
Rater countries
What made it work
- 1
Anchored rating definitions matter more than the scale itself. A 1 to 5 scale with vague descriptors produces low agreement. The same scale with reference videos and explicit per-point criteria converges fast.
- 2
Multi-region distribution surfaces bias instead of hiding it. Averaging single-region ratings gives the illusion of agreement and produces a model that fails outside that region. Cross-region consensus forces the schema to handle perceptual variance.
- 3
Optional reasoning beats required reasoning on subjective tasks. Required reasoning produces filler. Optional reasoning gets used where it matters and becomes useful training signal in its own right.
- 4
Pairwise preference is more stable than single-video rating for reward modeling. The client used single ratings for evaluation dashboards and pairwise preferences for training, which matched the literature on RLHF preference data quality.
References
Published research that informed the labeling schema and workflow.
- Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). · NeurIPS 2022
- Huang, Z. et al. (2024). VBench: Comprehensive Benchmark Suite for Video Generative Models. · CVPR 2024
- Zhang, R. et al. (2018). The Unreasonable Effectiveness of Deep Features as a Perceptual Metric (LPIPS). · CVPR 2018
- Mittal, A., Moorthy, A. K., Bovik, A. C. (2012). No-Reference Image Quality Assessment in the Spatial Domain (BRISQUE). · IEEE TIP
- Wang, J. et al. (2023). Exploring CLIP for Assessing the Look and Feel of Images (CLIP-IQA). · AAAI 2023
More case studies
Document AI / Financial Services
Structured Extraction From 50,000 Financial Documents for a Document AI Vendor
Robotics / Imitation Learning
Action Trajectory Labeling for a Robotics Lab Training Manipulation Policies
Healthcare / Medical Imaging
Whole-Slide Pathology Annotation for a Histopathology AI Vendor
Agentic AI / AI Safety Evaluation
Decision-Quality Annotation for an Agentic AI in Security Incident Response
Robotics / Vision-Language Foundation Models
Scaling Multi-View Robotic Video Annotation From Manual Process to 1,000-Hour Ramp
Have a similar project?
Share your data and requirements. We will scope the workflow, team, timeline, and pricing model.