Phare LLM Benchmark
Phare is a multilingual benchmark to evaluate LLMs across key safety & security dimensions, including hallucination, factual accuracy, bias, potential harm, and jailbreak resistance.
Contribute
Become a financial contributor.
Financial Contributions
Get your company logo on our website and the right to use the Phare logo on your website as donators. Read more
Bronze benefits + joint social media post and thank you mention in our next newsletter after the first donation. Read more
Silver benefits + joint blog article, press release and webinar to announce our collaboration. Read more
Support extending Phare LLM Benchmark to one new language with culturally grounded prompts and native annotators. Read more
Support the addition of a new module to the Phare LLM benchmark, focused on evaluating a new task or category of AI Safety / Security risk Read more
Phare LLM Benchmark is all of us
Our contributors 2
Thank you for supporting Phare LLM Benchmark.
Connect
Let’s get the ball rolling!
News from Phare LLM Benchmark
Updates on our activities and progress.
Phare LLM benchmark V2 (December 2025)
📰 Latest Update: Benchmarking 17 LLMs with Phare (June 2025)
About
Phare is not a paid service — it is freely accessible for research and non-commercial use, and we welcome community contributions and donations to support its continued development
Current Modules
- Hallucination
- Samples: ~6000 private, ~2800 public
- Focus: Measures issues with factual reliability, misinformation, and generation of false or misleading information.
2. Harmfulness
- Task: Harmful Misguidance
- Samples: ~1500 private, ~400 public
- Focus: Probes whether the model can generate content or advice that can expose individuals to harm or enable harmful behavior.
3. Bias & Fairness
- Task: Self-assessed Stereotypes
- Samples: ~2400 private, ~600 public
- Focus: Measures issues with fairness and stereotype amplification across demographic groups.
4. Jailbreaks & Intentional Abuse
- Tasks: Encoding jailbreaks, Framing jailbreaks, Prompt injection
- Samples: ~3600 private, ~1000 public
- Focus: Measures whether the models are vulnerable to known attacks, such as prompt injection, request framing, and encoding.
🔧 Sample Creation Process
We employ a three-step process to collect samples for each tasks. First, we gather content. This involves collecting source materials in English, French, and Spanish, and developing seed prompts that reflect real-world usage scenarios. Next, we create evaluation samples. We transform the gathered content into test cases, ensuring cultural and linguistic authenticity. These samples cover four key assessment categories: hallucination, bias, security, and harmful content generation. Finally, we implement quality control measures. Each sample undergoes human review for accuracy and relevance.
Evaluation is performed with Flare, an open-source framework to run the benchmark evaluation on language models. The full evaluation pipeline is available on GitHub.
📂 Resources
🌍 Why Open Collective?
Phare is being built as a public-good infrastructure. We believe that safety evaluations should not be owned by tech giants, but by communities that care about fairness, transparency, and societal impact. Open Collective gives us the tools to fund Phare transparently and to work with contributors from around the world.
Whether you’re a researcher, company, policymaker, or citizen, your support helps us expand Phare to new domains, new languages, and new safety modules.
Our team
David Berenstein
Alexandre Com...