Phare LLM Benchmark

Phare is an open, independent & multilingual benchmark to evaluate LLMs across key safety & security dimensions, including hallucination, factual accuracy, bias, and potential harm.

Contribute


Become a financial contributor.

Financial Contributions

Recurring contribution

Get your company logo on our website and the right to use the Phare logo on your website as donators. Read more

€1,000 EUR / month
Recurring contribution

Bronze benefits + joint social media post and thank you mention in our next newsletter after the first donation. Read more

€5,000 EUR / month
Recurring contribution

Silver benefits + joint blog article, press release and webinar to announce our collaboration. Read more

€10,000 EUR / month
One-time contribution

Support extending Phare LLM Benchmark to one new language with culturally grounded prompts and native annotators. Read more

€100,000 EUR
One-time contribution

Support the addition of a new module to the Phare LLM benchmark, focused on evaluating a new task or category of AI Safety / Security risk Read more

€200,000 EUR
Custom contribution
Donation
Make a custom one-time or recurring contribution.

Phare LLM Benchmark is all of us

Our contributors 2

Thank you for supporting Phare LLM Benchmark.

Connect


Let’s get the ball rolling!

News from Phare LLM Benchmark

Updates on our activities and progress.

📰 Latest Update: Benchmarking 17 LLMs with Phare (June 2025)

We recently released the first large-scale evaluation using Phare, testing 17 leading language models across our three core safety modules: hallucination, bias & stereotypes, and harmful content generat...
Read more
Published on June 16, 2025 by Stanislas Renondin

About


🧠 About Phare 
 
Phare (Potential Harm Assessment & Risk Evaluation) is an open, multilingual benchmark for evaluating the safety of large language models (LLMs). Developed independently by Giskard and open to contributions from the research community, Phare provides a transparent, reproducible, and culturally inclusive assessment framework. Our aim is to build a public infrastructure that supports the responsible deployment of LLMs in society. 
 
Phare evaluates models such as GPT-4, Claude, Gemini and open-source alternatives on three major safety dimensions: hallucinations, bias and fairness, and harmful content generation. A fourth dimension, adversarial misuse and robustness, is under active development. 

Phare is not a paid service — it is freely accessible for research and non-commercial use, and we welcome community contributions and donations to support its continued development
 
🧪 Methodology & Structure 
 
Phare uses a modular evaluation architecture, with each module corresponding to a distinct risk category. Within each module, we define a set of tasks, each containing multiple prompt samples. These prompts are tested against language models and scored using a dedicated evaluation framework, LMEval. All benchmark metrics are computed by comparing model outputs against explicit scoring criteria. 
 

Current Modules


  1. Hallucination
- Tasks: Factuality, Misinformation, Debunking, Tools Reliability
- Samples: ~6000 private, ~1600 public
- Focus: Factual reliability and misleading information, including tool-based generation (e.g. RAG).
2. Harmfulness
- Tasks: Harmful Misguidance
- Samples: ~1500 private, ~400 public
- Focus: Generation of unsafe or dangerous advice, including unauthorized medical or legal information.
3. Bias & Fairness
-
Tasks: Self-assessed Stereotypes
- Samples: ~2400 private, ~600 public
- Focus: Discrimination and stereotype reinforcement across demographic groups and languages.
 
🔧 Sample Creation Process

We follow a three-step protocol to develop benchmark samples:
 
1. Content Sourcing
We collect material in English, French, and Spanish and craft seed prompts that reflect real-world usage scenarios.
2. Evaluation Sample Creation
We transform the content into testable prompts, ensuring linguistic and cultural authenticity. Each sample is matched with evaluation criteria.
3. Quality Control
All samples are reviewed by human annotators for relevance, clarity, and safety. We ensure coverage across four key categories: hallucination, bias, harmful content, and adversarial misuse.
 
Evaluation is performed using LMEval, an open-source framework. We will soon release our full evaluation pipeline to ensure full reproducibility.
 

📂 Resources



🌍 Why Open Collective?

Phare is being built as a public-good infrastructure. We believe that safety evaluations should not be owned by tech giants, but by communities that care about fairness, transparency, and societal impact. Open Collective gives us the tools to fund Phare transparently and to work with contributors from around the world.
Whether you’re a researcher, company, policymaker, or citizen, your support helps us expand Phare to new domains, new languages, and new safety modules.

 

Our team