Scope
The benchmark covers 15 problems across logic, quantifiers, divisibility, combinatorics, probability, invariants, graph theory, and strategy puzzles.
Score
- Final: 135 / 150
- Average: 9.0 / 10
Where It Performed Best
- Probability and Bayes calculations
- Invariant and parity reasoning
- Single-constraint combinatorics
- Graph degree reasoning
Where It Slipped
- Multi-constraint counting cases
- Quantifier interaction edge cases
Notes
| Item | Observation |
|---|---|
| Q2 | Weak quantifier interaction handling |
| Q9 | Multi-constraint counting error |
| Remaining set | Mostly high confidence and correct |
Verdict
A strong reasoning profile for web interaction use. The main gap is global constraint reconciliation when multiple rules overlap.