Scope
The test measures whether the model can retain and apply rules from a long distributed-system spec while answering follow-up questions.
Score
- Overall: 8.2 / 10
- Core Q1-Q7: 7.8 / 10
- Hidden ordering paradox: 9 / 10
- Failure plus consistency stress: 8.5 / 10
Key Successes
- Correct worker count recall
- Correct exactly-once impossibility handling
- Correct retry overflow interpretation
- Good cross-section rule referencing
Main Gaps
- Overconfident interpretation when hierarchy is not explicit
- Simplified strong-consistency timing analysis
- Limited deep contradiction discovery
Verdict
A clear step up from the earlier logic benchmark, especially under formal specification framing.