Classifier

./dev bench classify reports 40/40 on tests/eval/corpus.yaml. That number is in-corpus: the classifier rules were tuned against those same 40 fixtures, so it measures fit, not generalization. This page records an independent held-out check โ€” real sites that are in neither the corpus nor tests/optimization/sites.tsv โ€” to see how the classifier does on data it was never shaped by.

Reproduce (live, point-in-time):

terminal
./dev bench sweep --sites tests/optimization/holdout.tsv --timeout 25s

Ground-truth labels were assigned by inspecting the actual HTTP fetch: main content present โ†’ static/ssr/hydrated; absent / JS-only โ†’ spa/dynamic; bot wall โ†’ blocked. App sites are labelled by their server-rendered marketing homepage, not the in-app SPA.

Result (snapshot 2026-05-31T07:46:41Z, git bdacbf0, 31 labelled sites)

MetricIn-corpus (tuned)Held-out (this set)
6-way pageClass exact-match accuracy1.00 (40/40)0.71 (22/31)
Browser-routing decision (browserRecommended)โ€”1.00 (31/31)

The 6-way number drops on held-out data โ€” but every one of the 9 class disagreements is routing-equivalent, i.e. it does not change the extract-vs-browser decision:

SiteHand labelClassifier classRouting impact
notion.so, canva.com, cnbc.com, theverge.com, arstechnica.com, walmart.comssrhydratednone โ€” both extract
nasa.govssrstaticnone โ€” both extract
imdb.com, espn.comblocked (HTTP 202 wall)spanone โ€” both needs-browser

ssr/static/hydrated all route to extract; spa/dynamic/blocked all route to needs-browser. Measured at the decision level โ€” the signal the product actually routes on โ€” the classifier flagged 7/7 needs-browser sites with zero false positives and zero false negatives.

Takeaway

This is the empirical backing for the routing guidance in api.md / browser-discriminator.md: the browser-routing decision generalizes (held-out 31/31), while the raw 6-way pageClass (like the quality float) is a softer signal that blurs on the ssr/static/hydrated boundary. Consumers should route on profile.decision / profile.browserRecommended, not on the raw class or quality.

Caveats: small sample (31), live and point-in-time (sites change, and a couple of labels โ€” imdb/espn bot interstitials โ€” are themselves fetch-dependent), and two sites (target.com, nginx.com) were left any as unresolved-thin. The set is a generalization probe, not a fixed regression gate; re-run the sweep to refresh.