Sarvam AI Long-Context Spec Consistency Benchmark | Sarvam AI

Back to all reports

Scope

The test measures whether the model can retain and apply rules from a long distributed-system spec while answering follow-up questions.

Score

Overall: 8.2 / 10
Core Q1-Q7: 7.8 / 10
Hidden ordering paradox: 9 / 10
Failure plus consistency stress: 8.5 / 10

Key Successes

Correct worker count recall
Correct exactly-once impossibility handling
Correct retry overflow interpretation
Good cross-section rule referencing

Main Gaps

Overconfident interpretation when hierarchy is not explicit
Simplified strong-consistency timing analysis
Limited deep contradiction discovery

Verdict

A clear step up from the earlier logic benchmark, especially under formal specification framing.