Phare LLM Benchmark

We recently released the first large-scale evaluation using Phare, testing 17 leading language models across our three core safety modules: hallucination, bias & stereotypes, and harmful content generat...

Published on June 16, 2025 by Stanislas Renondin

About

🧠 About Phare

Phare (Potential Harm Assessment & Risk Evaluation) is an open, multilingual benchmark for evaluating the safety of large language models (LLMs). Developed independently by Giskard and open to contributions from the research community, Phare provides a transparent, reproducible, and culturally inclusive assessment framework. Our aim is to build a public infrastructure that supports the responsible deployment of LLMs in society.

Phare evaluates models such as GPT-4, Claude, Gemini and open-source alternatives on three major safety dimensions: hallucinations, bias and fairness, and harmful content generation. A fourth dimension, adversarial misuse and robustness, is under active development.

Phare is not a paid service — it is freely accessible for research and non-commercial use, and we welcome community contributions and donations to support its continued development

🧪 Methodology & Structure

Phare uses a modular evaluation architecture, with each module corresponding to a distinct risk category. Within each module, we define a set of tasks, each containing multiple prompt samples. These prompts are tested against language models and scored using a dedicated evaluation framework, LMEval. All benchmark metrics are computed by comparing model outputs against explicit scoring criteria.

Current Modules

Hallucination

- Tasks: Factuality, Misinformation, Debunking, Tools Reliability
- Samples: ~6000 private, ~1600 public
- Focus: Factual reliability and misleading information, including tool-based generation (e.g. RAG).
2. Harmfulness
- Tasks: Harmful Misguidance
- Samples: ~1500 private, ~400 public
- Focus: Generation of unsafe or dangerous advice, including unauthorized medical or legal information.
3. Bias & Fairness
- Tasks: Self-assessed Stereotypes
- Samples: ~2400 private, ~600 public
- Focus: Discrimination and stereotype reinforcement across demographic groups and languages.

🔧 Sample Creation Process

We follow a three-step protocol to develop benchmark samples:

1. Content Sourcing

We collect material in English, French, and Spanish and craft seed prompts that reflect real-world usage scenarios.

2. Evaluation Sample Creation

We transform the content into testable prompts, ensuring linguistic and cultural authenticity. Each sample is matched with evaluation criteria.

3. Quality Control

All samples are reviewed by human annotators for relevance, clarity, and safety. We ensure coverage across four key categories: hallucination, bias, harmful content, and adversarial misuse.

Evaluation is performed using LMEval, an open-source framework. We will soon release our full evaluation pipeline to ensure full reproducibility.

📂 Resources

🌍 Why Open Collective?

Phare is being built as a public-good infrastructure. We believe that safety evaluations should not be owned by tech giants, but by communities that care about fairness, transparency, and societal impact. Open Collective gives us the tools to fund Phare transparently and to work with contributors from around the world.
Whether you’re a researcher, company, policymaker, or citizen, your support helps us expand Phare to new domains, new languages, and new safety modules.

Our team

David Berenstein

Core Contributor

Alexandre Com...

Admin