Evaluating the Generative Capabilities of LLMs in Russian.
Benchmark and a family of LM-as-a-Judge models.
Welcome to POLLUX – an open-source project dedicated to evaluating the generative capabilities of modern large language models (LLMs) in Russian.
Our comprehensive evaluation framework is built on three foundational pillars. First, we provide carefully developed 📊 taxonomies that systematically categorize both generative tasks and evaluation criteria. Second, our meticulously crafted 🌟 benchmark comprises 2,100 unique, manually created instructions paired with 471,515 detailed point criteria assessments. Finally, POLLUX features a specialized ⚖️ family of LLM-based judges that automate the evaluation process, enabling scalable and systematic assessment of model outputs across all task categories.
↗ 🧭 Explore the benchmark on the project page.
↗ 🤗 See Hugging Face collection for the dataset and the models.
-
📚 152 diverse tasks: Covering open-ended generation, text-to-text transformation, information-seeking, and code-related prompts. The task taxonomy is grounded in analysis of real-world user queries.
-
🌡️ 66 unique evaluation criteria: A rich set of non-overlapping fine-grained metrics — ranging from surface-level quality (e.g. absence of artifacts) to higher-level abilities like reasoning and creativity. Each criterion comes with a clearly defined evaluation scale.
-
📊 Three difficulty levels: Tasks are organized into easy, medium, and hard tiers to support targeted model diagnostics.
-
👩🏼🎓 Expert-curated tasks: All tasks and criteria are designed from scratch by domain experts to ensure quality and relevance. All instructions and criteria annotations are similarly developed and reviewed by experts panels to maintain consistent standards throughout the evaluation process.
-
🤖 LLM-based evaluators: A suite of judge models (7B and 32B) trained to assess responses against specific criteria and generate score justifications. Supports custom criteria and evaluation scales via flexible input formatting (beta).
Score model outputs with POLLUX judges: demo.ipynb
To get scores for a custom model, run one of the code variants below; answers will be saved in the model's folder under results/. The maximum score is 2.
1. Clone and install
git clone https://github.com/ai-forever/POLLUX.git
cd POLLUX
pip install -r requirements.txtUse any OpenAI-compatible endpoint (e.g. local server or OpenAI). Set OPENAI_API_KEY or pass --api-key and --base-url.
2. Generate model answers
vllm serve <model_name> --port 8000python src/answer.py \
--split train \
--model <model_name> \
--backend openai \
--api-key NONE \
--base-url http://localhost:8000/v1 \
--max-tokens 1024 \
--temperature 0.5 \
--concurrency 1003. Score answers with POLLUX judge
vllm serve <judge_model> --port 8888python src/score.py <model_name> \
--split train \
--backend openai \
--judge-model <judge_model (ai-forever/pollux-judge-7b or ai-forever/pollux-judge-32b)> \
--api-key NONE \
--base-url http://localhost:8888/v1 \
--max-tokens 1024 \
--temperature 0.1 \
--concurrency 1004. Compute metrics
python src/metrics.py <model_name> --split trainRun inference and judging locally with vLLM. No API key or base URL required.
2. Generate model answers
python src/answer.py \
--split train \
--model <model_name> \
--backend vllm \
--max-tokens 1024 \
--temperature 0.5 \
--tensor-parallel-size 13. Score answers with POLLUX judge
python src/score.py <model_name> \
--split train \
--backend vllm \
--judge-model <judge_model (ai-forever/pollux-judge-7b or ai-forever/pollux-judge-32b)> \
--max-tokens 1024 \
--temperature 0.1 \
--tensor-parallel-size 14. Compute metrics
python src/metrics.py <model_name> --split trainPOLLUX/
├── images/ # Project logos
├── metainfo/ # Benchmark metadata
├── clustering_demo.ipynb # User logs analysis
├── src/ # Inference tools
│ ├── answer.py # Generate model answers
│ ├── score.py # Run POLLUX judges
│ ├── metrics.py # Aggregate metrics
│ └── inference.py # Full evaluation pipeline
├── LICENSE
└── demo.ipynb # Quick inference demo
The POLLUX benchmark is built upon comprehensive taxonomies of generative tasks and evaluation criteria. Our taxonomy of generative tasks encompasses 35 general task groups organized across two hierarchical levels (functional styles/substyles and genres), covering a total of 152 distinct tasks. 📊
Our taxonomy of evaluation criteria features five comprehensive categories that assess:
- 🔍 General & Critical: Core syntactic, lexical, and semantic text properties
- 🎯 Domain-specific: Properties tied to specialized functional styles
- ✅ Task-specific: Task-oriented markers and requirements
- 💭 Subjective: Human preferences and subjective opinions
▎📈 Benchmark Scale & Coverage
The benchmark contains 2,100 unique instructions evenly distributed across all 35 task groups, with three complexity levels per group. Each instruction includes responses from 7 top-tier LLMs:
- 🤖 OpenAI o1 & GPT-4o
- 🧠 Claude 3.5 Sonnet
- 🦙 Llama 405B
- ⚡️ T-pro-it-1.0
- 🔍 YandexGPT 4 Pro
- 💎 GigaChat Max
This results in 11,500 total responses across the benchmark! 🚀
▎🔬 Expert Evaluation Process
Every response is scrupulously evaluated using a tailored criteria set combining:
- Critical, Subjective, and General criteria
- Relevant Domain- and Task-specific criteria
With at least two expert evaluators per criterion, we've collected:
- 471,000+ individual criteria estimates with textual rationales ✍️
- 161,076 aggregate (over overlap) numerical scores 📊
▎🌐 Access & Exploration
Ready to dive in? Access the benchmark on its home page and explore the data through our interactive demo! 🎮
POLLUX includes a family of LLM-based judges, trained to evaluate model outputs against scale-based criteria. The judges are designed to be flexible and can be adapted to different evaluation scales and criteria.
We provide two versions of the judges:
- 7B (T-lite-based): A smaller model that is faster and more efficient, suitable for quick evaluations and lower resource environments.
- 32B (T-pro-based): A larger model that provides more accurate evaluations, suitable for high-performance environments.
There are two architecture types in both sizes:
- seq2seq: A sequence-to-sequence model that generates a score and its justification in a decoder-only manner as a joint text output.
- regression (-r in HF model identifiers): A regression model that outputs a numeric score from an added regression head and generates the score justification in a decoder-only manner.
This project is licensed under the MIT License. See LICENSE for details.
If you use POLLUX in your research, please cite the following paper:
@misc{martynov2025eyejudgementdissectingevaluation,
title = {Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX},
author = {Nikita Martynov and Anastasia Mordasheva and Dmitriy Gorbetskiy and Danil Astafurov and Ulyana Isaeva and Elina Basyrova and Sergey Skachkov and Victoria Berestova and Nikolay Ivanov and Valeriia Zanina and Alena Fenogenova},
year = {2025},
eprint = {2505.24616},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2505.24616}
}Made with ❤️ by the POLLUX team