How the engine actually works

No magic, and no claim we’ve beaten anyone yet. This is a reasoning engine — it proposes answers, places them, and retracts them when the crossings prove it wrong — plus an honest account of where it stands against the best crossword AIs ever built.

The pipeline

01 · Propose. For each clue, a language model and a 245k-clue historical index produce a distribution over candidate answers — not one guess, a ranked field with probabilities.
02 · Place. A deterministic constraint engine owns the grid: forward-checking, MRV (minimum-remaining-values) to pick the highest-leverage slot next, and conflict-directed retraction. It is the same family as Dr.Fill and the Berkeley Crossword Solver — symbolic reasoning, not a word-list lookup.
03 · Doubt & retract. When a crossing contradicts a placed answer, the engine un-places it and explains why. Every place and retract is recorded with the reasoning behind it.

Where we stand — honestly

Published numbers for the field, and ours next to them. We’ll update this as the benchmark moves.

System	What it does	Published result
Berkeley Crossword Solver (2021)	Neural QA → loopy belief propagation → local-search correction	82% full-puzzle, 99.9% letter (themeless). First program to beat every human at the ACPT. The bar.
Dr.Fill (Ginsberg)	Heuristic search over a clue database	Near top-human for years; hybridized with Berkeley for the 2021 win.
OneLook	Pattern + dictionary/definition lookup	Excellent reference search — but no reasoning, no grid, no explanation.
Crossword Genius (Ross)	Cryptic clue explanation	Strong at explaining cryptic wordplay; single-clue, no full-grid reasoning.
Across & Down (ours)	LLM + historical distribution propose → constraint engine places → records reasoning per move	Clue-answer recall: ~50% top-1, ~67% top-20 on held-out historical clues. Full 15×15 solve: below Berkeley today — we’re building the same belief-propagation path that closes the gap.

The honest version: on a fully filled hard grid we are not yet at Berkeley’s accuracy. We say so on purpose. What we have that none of them ship: a recorded, replayable chain of reasoning for every move.

What only we do

Replayable reasoning. Every solve stores each placement and retraction with the why. You can scrub it move by move. Berkeley wins; it doesn’t show its work. We do.
Honest uncertainty. Cryptic mode breaks down the definition and wordplay and tells you plainly what it can’t parse — a calibrated “not sure” beats a confident fabrication.
Your puzzle, by photo. Snap the puzzle in your actual newspaper; it reads the grid and every clue and assists without spoiling.
Power pattern search. _ one letter, * any run, :meaning a semantic filter — built to out-reach OneLook on the queries that matter mid-solve.

The roadmap to the bar

✓ Shipped. Distributional clue-answering (the historical index — 50× better top-1 than length+frequency).
→ Building. Belief-propagation grid solving over those distributions — the step that took Berkeley to 99.9% letter accuracy.
○ Next. Self-correction pass; a modern clue bank for contemporary answers; a public daily benchmark on real published puzzles.

Try a clue →Watch it solve

Methodology: recall measured on a puzzle-level held-out split of the public-domain NYT archive (≈450k clue→answer pairs). Berkeley figures from Wallace et al., ACL 2022. Brutal feedback welcome — roger@grubb.net.