All Reports

Detailed content is organized here track-wise. Use the sections below for fast navigation.

Reasoning Systems Coding Benchmark Timeline

Reasoning (7)

Section average: 7.89 / 10

Sarvam AI Web Reasoning Benchmark

Strong logic and math performance, with weaker handling of overlapping constraints.

Score 9.00 15 questions Run 1

Sarvam AI Multi-Layer Nested Hypothetical Benchmark

Very strong layered reasoning and scope discipline with minor hierarchy-depth gaps.

Score 8.80 5 questions Run 9

Sarvam AI Partial Spec Corruption Benchmark

Strong contradiction detection with moderate depth limits in enforcement-level explanation.

Score 8.50 10 cases Run 7

Sarvam AI Long-Context Spec Consistency Benchmark

Strong long-context retention and rule cross-referencing with moderate consistency-depth gaps.

Score 8.20 9 questions Run 4

Sarvam AI Silent Inconsistency Injection Benchmark

Good silent contradiction detection, with weaker real-time enforceability and resource-bound analysis.

Score 7.80 10 cases Run 8

Sarvam AI Adversarial Spec Mutation Benchmark

Good mutation tracking, but weak enforceability modeling under distributed constraints.

Score 7.10 10 questions Run 6

Sarvam AI Multi-Step Logic Stress Benchmark

Solid linear reasoning but weak global validation and constraint reconciliation.

Score 5.80 5 questions Run 5

Systems (1)

Section average: 6.00 / 10

Sarvam AI Deep Systems Engineering Benchmark

Good architecture framing, but inconsistent consensus depth and quantitative rigor.

Score 6.00 15 questions Run 3

Coding (2)

Section average: 8.46 / 10

Sarvam AI Python Coding Benchmark

Senior-level algorithmic performance in Python with minor edge-case weaknesses.

Score 9.03 15 questions Run 2

Sarvam AI Cross-Language Coding Stress Benchmark

Strong paradigm switching across languages, but concurrency correctness is not fully production-grade.

Score 7.90 5 stages Run 10

Benchmark Timeline

Run 1: Web Reasoning
Reasoning | Score 9.00 / 10 | 15 questions
Run 2: Python Coding
Coding | Score 9.03 / 10 | 15 questions
Run 3: Deep Systems
Systems | Score 6.00 / 10 | 15 questions
Run 4: Long-Context Spec
Reasoning | Score 8.20 / 10 | 9 questions
Run 5: Multi-Step Logic
Reasoning | Score 5.80 / 10 | 5 questions
Run 6: Spec Mutation
Reasoning | Score 7.10 / 10 | 10 questions
Run 7: Spec Corruption
Reasoning | Score 8.50 / 10 | 10 cases
Run 8: Silent Inconsistency
Reasoning | Score 7.80 / 10 | 10 cases
Run 9: Nested Hypothetical
Reasoning | Score 8.80 / 10 | 5 questions
Run 10: Cross-Language Coding
Coding | Score 7.90 / 10 | 5 stages