Architecture

Name: SeaPortal
Author: Luigi Agosti

Overview

SeaPortal is an HTTP-first content extraction tool designed for AI agents. It fetches web pages using standard HTTP requests (no browser) and extracts clean, structured content.

Pipeline

URL → HTTP Fetch → Bot Detection → Classification → Extraction → Cleanup → Output

Stages

HTTP Fetch — TLS fingerprint-resistant requests via uTLS
Bot Detection — Identifies Cloudflare, PerimeterX, Incapsula challenges
Classification — Categorises pages: static, SSR, hydrated, spa, dynamic, blocked
Extraction — Readability-based content extraction + HTML-to-Markdown
Cleanup — Deduplication, preprocessing, quality scoring
Output — Markdown text, JSON, or accessibility snapshot

Key Design Decisions

No browser dependency — HTTP-only keeps it fast and lightweight
uTLS for stealth — Mimics real browser TLS fingerprints
Classification-first — Knowing the page type guides extraction strategy
Quality scoring — Every extraction gets a confidence score
Accessibility snapshots — Structured tree output for AI consumption

Directory Structure

seaportal.go       Public API (package seaportal — re-exports internal/engine)
cmd/seaportal/     CLI entry point (extract, sitemap, feed, mcp)
cmd/seabench/      Benchmark / evaluation harness
internal/engine/   Core extraction, classification, quality scoring, snapshots
internal/mcp/      MCP server (JSON-RPC over stdio)
internal/testserver/  Hermetic test fixtures
testdata/          HTML fixtures for testing
tests/e2e/         Docker-based end-to-end tests