Architecture
Overview
SeaPortal is an HTTP-first content extraction tool designed for AI agents. It fetches web pages using standard HTTP requests (no browser) and extracts clean, structured content.
Pipeline
URL → HTTP Fetch → Bot Detection → Classification → Extraction → Cleanup → Output
Stages
- HTTP Fetch — TLS fingerprint-resistant requests via uTLS
- Bot Detection — Identifies Cloudflare, PerimeterX, Incapsula challenges
- Classification — Categorises pages: static, SSR, hydrated, spa, dynamic, blocked
- Extraction — Readability-based content extraction + HTML-to-Markdown
- Cleanup — Deduplication, preprocessing, quality scoring
- Output — Markdown text, JSON, or accessibility snapshot
Key Design Decisions
- No browser dependency — HTTP-only keeps it fast and lightweight
- uTLS for stealth — Mimics real browser TLS fingerprints
- Classification-first — Knowing the page type guides extraction strategy
- Quality scoring — Every extraction gets a confidence score
- Accessibility snapshots — Structured tree output for AI consumption
Directory Structure
seaportal.go Public API (package seaportal — re-exports internal/engine)
cmd/seaportal/ CLI entry point (extract, sitemap, feed, mcp)
cmd/seabench/ Benchmark / evaluation harness
internal/engine/ Core extraction, classification, quality scoring, snapshots
internal/mcp/ MCP server (JSON-RPC over stdio)
internal/testserver/ Hermetic test fixtures
testdata/ HTML fixtures for testing
tests/e2e/ Docker-based end-to-end tests