Sony AI has released the Fair Human‑Centric Image Benchmark (FHIBE), a publicly available, consent‑based image dataset and evaluation suite aimed at surfacing and diagnosing bias across a wide range of computer‑vision tasks. Launched alongside a paper in Nature, FHIBE is positioned as a practical example of building an ethically minded dataset — and as a corrective to the many large image collections assembled by web scraping without subject consent.
What FHIBE is and why it matters
FHIBE (pronunciations in coverage have varied) contains 10,318 images of 1,981 unique people from 81 countries and territories. The set was assembled specifically for fairness evaluation rather than model training: images come with rich, mostly self‑reported metadata (pronouns, age, ancestry categories, Fitzpatrick skin tones and more) plus pixel‑level labels — face and person bounding boxes, 33 keypoints and 28 segmentation categories — intended to support tasks from pose estimation and person detection to face verification and visual question answering.
Sony says the dataset fills gaps left by existing benchmarks: unlike many popular corpora, FHIBE was collected with documented informed consent, paid participants, GDPR‑style privacy safeguards and an explicit mechanism that lets subjects revoke consent and have their images removed. The dataset and its access portal are available for registered users at FHIBE, with terms that limit use to fairness evaluation and mitigation (the dataset is not permitted for general model training).
A design built around consent, compensation and privacy
FHIBE’s developers emphasize several practical choices intended to reduce harm and increase transparency:
- Consent and withdrawal: every subject provided informed consent; subjects may request removal of their images and the dataset maintainers commit to replacing withdrawn images where possible.
- Compensation: vendors were required to pay at least local minimum wages; authors report the median compensation for image subjects was roughly 12× the applicable minimum wage. The project also discloses the substantial cost of the effort: Sony collected 28,703 candidate images to yield the final 10,318-image release, with per‑image vendor payments and additional fixed costs for QA, legal and platform work amounting to several hundred thousand dollars.
- Privacy measures: incidental people and personally identifiable items detected in imagery were redacted using a text‑guided diffusion inpainting pipeline and then manually checked. Images were also screened against known child‑abuse image hashes and for prominent third‑party IP. Exif metadata were stripped or coarsened, and some sensitive attributes (for example, disability, pregnancy, height and weight) are reported only in aggregate.
- Annotation practices: many demographic and contextual labels are self‑reported by subjects; objective image labels (keypoints, segmentation) were produced by trained annotators and QA teams. Annotator IDs (anonymized) and, where disclosed, self‑reported annotator demographics are included to improve transparency.
- Narrow (task‑specific) models: across eight tasks — including pose estimation, person segmentation, person detection, face detection and face verification — the largest performance gaps occurred at intersections of attributes (pronoun × age × ancestry × skin tone). Common findings included better performance for younger people, lighter skin tones and Asian ancestry groups, while older people, darker skin tones and African ancestry groups were more often in lower‑performing subsets. But results varied by task and model, underscoring the dataset’s value for granular diagnostics.
- New, actionable insights: for face detection, FHIBE’s annotations helped trace a pronoun‑related disparity to a concrete image feature: ‘no visible hair’ (often baldness) made face detection harder for some `he/him` images, while certain headwear that preserved facial contours made detection easier in some `she/her` images. Face parsing models struggled more with older subjects and individuals with grey or white facial hair; face verification errors were linked to greater hairstyle variability among `she/her` subjects.
- Foundation (vision‑language) models: FHIBE was used to probe CLIP and BLIP‑2. CLIP showed a tendency to assign gender‑neutral labels more often to `he/him` images than to `she/her` images and associated certain ancestries with specific scenes (for example, rural or outdoor settings). BLIP‑2’s open‑ended VQA outputs exhibited troubling stereotypes: neutral prompts produced occupation inferences that sometimes invoked sex‑worker or criminal labels for particular ancestry or pronoun groups, and negatively framed prompts elicited toxic responses more often for people of African or Asian ancestry, darker skin tones and some `he/him` groups.
- Expense and scale: ethically collected, richly annotated datasets are expensive and time‑consuming. The project reports vendor image acquisition costs (roughly US$308,500 for the candidate images) plus nearly US$450,000 in fixed costs for QA, legal review and platform development, as well as years of staff time.
- Visual diversity vs. web‑scraped scale: compared with huge web‑scraped collections, FHIBE is smaller and — by design — somewhat less visually unconstrained. Subjects tended to be closer to camera and more centred, which can make certain tasks easier for models; the authors note that models sometimes perform better on FHIBE than on more chaotic, scraped benchmarks.
- Fraud and verification challenges: crowdsourced, paid collection raises the risk of fraudulent submissions (people uploading images they don't own or didn’t take). The project removed thousands of suspect images after automated and manual checks (including web‑search checks); excluded images were disproportionately from some demographic groups, a side effect that complicates collection strategies aimed at diversity.
- Use restrictions: to preserve FHIBE’s value as an evaluation benchmark and reduce harms, the dataset license forbids use for general model training (with a narrow exception for bias mitigation tools). Users must register and accept terms to obtain access.
- Intersectional diagnosis: detailed, self‑reported demographic labels plus pixel‑level annotations make it possible to test and explain disparities at intersectional granularities that many datasets lack.
- Foundation model auditing: the dataset’s real‑world images (not synthetic probes) can be used to probe VLMs for harmful stereotypes and spurious associations.
- A template for responsible data curation: FHIBE documents a workflow for consent, compensation, privacy redaction, QA and consent revocation — practices other groups can adapt.
What the benchmark found: confirming and revealing biases
Using FHIBE to evaluate a broad set of pretrained models, Sony’s team confirmed known fairness problems and exposed new patterns:
These outcomes illustrate the kinds of intersectional problems FHIBE is designed to expose and the value of dense, self‑reported annotations in tracing sources of error beyond crude demographic proxies.
Limits, trade‑offs and costs
The FHIBE team is explicit about trade‑offs and limitations:
How researchers and industry can use FHIBE
FHIBE is intended primarily as an evaluation and debugging tool for developers and researchers who want to understand how their models perform across many demographic and contextual axes. Its contributions include:
Researchers and practitioners can request access at the project portal: FHIBE.
Implications and next steps
FHIBE demonstrates that it is technically and legally feasible to build a consented, richly annotated fairness benchmark. At the same time, the project highlights important systemic tensions: ethical collection at scale is costly; efforts to incentivize under‑represented groups can create fraud incentives; and restricting dataset use to evaluation reduces some harms but limits downstream research choices. The authors hope their documentation and costs will inform funders and the research community about the true expense of ethical data curation and encourage investment in scalable, rights‑respecting methods.
For model builders, FHIBE provides actionable tools to go beyond headline fairness numbers and probe the causes of errors. For policymakers and advocates, it is evidence that informed consent, compensation and privacy safeguards can be operationalized in practice — if organizations commit resources to do so.
As AI systems increasingly interact with people in sensitive contexts, benchmarks like FHIBE make visible the mismatch between current training corpora and the ethical standards many researchers and regulators now expect. FHIBE does not solve algorithmic bias by itself, but it supplies a practical, consented dataset and a set of analyses that can help engineers, auditors and policymakers identify where models go wrong and why.