How the engine actually works
No magic, and no claim we’ve beaten anyone yet. This is a reasoning engine — it proposes answers, places them, and retracts them when the crossings prove it wrong — plus an honest account of where it stands against the best crossword AIs ever built.
The pipeline
- 01 · Propose. For each clue, a language model and a 245k-clue historical index produce a distribution over candidate answers — not one guess, a ranked field with probabilities.
- 02 · Place. A deterministic constraint engine owns the grid: forward-checking, MRV (minimum-remaining-values) to pick the highest-leverage slot next, and conflict-directed retraction. It is the same family as Dr.Fill and the Berkeley Crossword Solver — symbolic reasoning, not a word-list lookup.
- 03 · Doubt & retract. When a crossing contradicts a placed answer, the engine un-places it and explains why. Every place and retract is recorded with the reasoning behind it.
Where we stand — honestly
Published numbers for the field, and ours next to them. We’ll update this as the benchmark moves.
| System | What it does | Published result |
|---|---|---|
| Berkeley Crossword Solver (2021) | Neural QA → loopy belief propagation → local-search correction | 82% full-puzzle, 99.9% letter (themeless). First program to beat every human at the ACPT. The bar. |
| Dr.Fill (Ginsberg) | Heuristic search over a clue database | Near top-human for years; hybridized with Berkeley for the 2021 win. |
| OneLook | Pattern + dictionary/definition lookup | Excellent reference search — but no reasoning, no grid, no explanation. |
| Crossword Genius (Ross) | Cryptic clue explanation | Strong at explaining cryptic wordplay; single-clue, no full-grid reasoning. |
| Across & Down (ours) | LLM + historical distribution propose → constraint engine places → records reasoning per move | Clue-answer recall: ~50% top-1, ~67% top-20 on held-out historical clues. Full 15×15 solve: below Berkeley today — we’re building the same belief-propagation path that closes the gap. |
The honest version: on a fully filled hard grid we are not yet at Berkeley’s accuracy. We say so on purpose. What we have that none of them ship: a recorded, replayable chain of reasoning for every move.
What only we do
- Replayable reasoning. Every solve stores each placement and retraction with the why. You can scrub it move by move. Berkeley wins; it doesn’t show its work. We do.
- Honest uncertainty. Cryptic mode breaks down the definition and wordplay and tells you plainly what it can’t parse — a calibrated “not sure” beats a confident fabrication.
- Your puzzle, by photo. Snap the puzzle in your actual newspaper; it reads the grid and every clue and assists without spoiling.
- Power pattern search. _ one letter, * any run, :meaning a semantic filter — built to out-reach OneLook on the queries that matter mid-solve.
The roadmap to the bar
- ✓ Shipped. Distributional clue-answering (the historical index — 50× better top-1 than length+frequency).
- → Building. Belief-propagation grid solving over those distributions — the step that took Berkeley to 99.9% letter accuracy.
- ○ Next. Self-correction pass; a modern clue bank for contemporary answers; a public daily benchmark on real published puzzles.
Methodology: recall measured on a puzzle-level held-out split of the public-domain NYT archive (≈450k clue→answer pairs). Berkeley figures from Wallace et al., ACL 2022. Brutal feedback welcome — roger@grubb.net.