Benchmark Status: This evaluation is based on the Sarvam AI web version only, from a single prompt run with no re-prompting. It has not been validated in real projects and remains incomplete until stable API access is available.

Back to all reports

Scope

The test measures whether the model can retain and apply rules from a long distributed-system spec while answering follow-up questions.

Score

  • Overall: 8.2 / 10
  • Core Q1-Q7: 7.8 / 10
  • Hidden ordering paradox: 9 / 10
  • Failure plus consistency stress: 8.5 / 10

Key Successes

  • Correct worker count recall
  • Correct exactly-once impossibility handling
  • Correct retry overflow interpretation
  • Good cross-section rule referencing

Main Gaps

  • Overconfident interpretation when hierarchy is not explicit
  • Simplified strong-consistency timing analysis
  • Limited deep contradiction discovery

Verdict

A clear step up from the earlier logic benchmark, especially under formal specification framing.