Clean extraction
Readable Markdown with nav, ads, and chrome removed β plus block and simhash near-duplicate dedupe.
Turn static and server-rendered pages into clean Markdown or a semantic accessibility tree your agent can act on. HTTP-first and browserless β fast on the pages that allow it, and it tells you when one actually needs a browser. Use it as a CLI, a Go library, or an MCP server.
$ npm install -g seaportal && seaportal https://example.com
It does not execute JavaScript, render pages, or manage sessions β by design. When a page needs any of that, SeaPortal tells you instead of guessing.
> Quick Start
Install the binary, point it at a URL, get clean content back. No server to host, no browser to drive.
# Install via npmnpm install -g seaportal
> Why SeaPortal
Your agent can already pull a URL with a plain fetch (the web_fetch tool, curl, an HTTP client). The difference is what comes back β a raw page dump versus structured content it can reason about and act on.
Already wired for agents: run seaportal mcp and the same engine shows up as MCP tools.
> What It Does
Fast, structured extraction you can drop straight into an agent loop.
Readable Markdown with nav, ads, and chrome removed β plus block and simhash near-duplicate dedupe.
A semantic tree of the page where every node has a role, ref, and selector. Filter to interactive elements only.
No headless overhead. On reachable static / SSR pages, extraction is typically p50 ~1s (50-site in-niche sweep β see Reliability).
A fail-over triage, not a universal fetcher: it emits a browser-recommendation signal so you can route hard pages elsewhere.
Flatten sitemap.xml (recurses indexes, auto-decompresses .gz) and parse RSS, Atom, and JSON Feed.
Run seaportal mcp to expose fetch_url, fetch_snapshot, parse_sitemap, and parse_feed as agent tools over stdio.
SSRF / private-IP blocking, http(s)-only, and redirect and response-body caps applied out of the box.
One Go binary, or import github.com/pinchtab/seaportal and call FromURL directly. Same engine everywhere.
Latency figures come from committed live sweeps β a 50-site in-niche sample and the Tranco top-1000. See the Reliability snapshot for the full p50 / p90 / p95 breakdown and method.
The core stays a fast, no-browser fetch-and-extract primitive. These build on top of it β reach for them only when a pipeline calls for it.
> Use Cases
A fast, no-browser fetch-and-extract primitive for agent and ingestion pipelines.
Hand an agent a URL and get clean content back fast β plus a clear signal when the page needs a real browser, so the agent can route instead of stalling.
Turn documentation, articles, and knowledge bases into clean Markdown with dedupe and BM25 section scoring β ready to chunk and embed.
Flatten a sitemap.xml (recursing indexes) or parse an RSS / Atom / JSON feed to enumerate what to fetch next β in one command.
Point it at URLs you do not control with SSRF / private-IP blocking, http(s)-only, and redirect and body caps on by default.
Use SeaPortal as the cheap first hop and branch on profile.decision / browserRecommended to escalate only the hard pages to a browser like PinchTab.
> Integration
Use SeaPortal as the cheap first hop, then escalate only the hard pages. Every fetch carries a routing decision so the agent never has to guess.
# Fetch once as JSON, then route on the profileresult=$(seaportal --json "$url")decision=$(echo "$result" | jq -r '.profile.decision')# profile.browserRecommended is the single boolean to branch on;# profile.decision is the detailed category behind it.case "$decision" in static-high-confidence|static-ok) # Clean extraction β hand result.content straight to the agent ;; browser-needed|blocked) # Escalate to a real browser such as PinchTab pinchtab open "$url" ;; *) # not-found | unreachable | unsupported β skip or report ;;esac
See the
browser discriminator
reference for every profile.decision value and exactly when each is emitted.
> Architecture
One binary, one extraction engine. There is no service to host β drive it from the command line, embed the Go library, or run it as an MCP server.
# Markdown (default), JSON, or an accessibility treeseaportal https://example.comseaportal --json https://example.comseaportal --snapshot https://example.com
> Commands
Extraction is the default verb; sitemap, feed, and mcp are subcommands. The full flag list lives in the CLI reference.
| Command | Description |
|---|---|
seaportal <url> | Extract a page as Markdown (the default verb) |
seaportal --json <url> | Structured JSON output instead of Markdown |
seaportal --snapshot <url> | Accessibility tree with element refs and selectors |
seaportal --fast <url> | Bail early if the page needs a browser |
seaportal --query "..." <url> | Rank sections by BM25 relevance (pair with --top-n) |
seaportal sitemap <url> | Flatten a sitemap.xml (recurses indexes, .gz aware) |
seaportal feed <url> | Parse RSS / Atom / JSON Feed into unified entries |
seaportal mcp | Run as an MCP server over stdio |
See the CLI reference for every flag β output formats, dedupe, caching, retries, link/image/table extraction, and the security policy.
> Plain Language
A quick decoder for the jargon used across this page and the docs.
SeaPortal fetches a page over plain HTTP and never launches a browser. That makes it fast and cheap on static and server-rendered pages β and means it does not run JavaScript.
A JSON tree of the page where every element has a role, a stable ref (e1, e2β¦), and a CSS selector β so an agent can point at a specific button or field instead of parsing raw text.
On each fetch SeaPortal classifies the page and emits profile.decision and browserRecommended. When a page truly needs rendering, it tells you so you can route it to a browser.
Model Context Protocol β run "seaportal mcp" to expose fetch_url, fetch_snapshot, parse_sitemap, and parse_feed as tools an MCP-aware agent can call over stdio.
> Security
SeaPortal runs locally β there is no server to lock down. The fetch path itself ships hardened, and you loosen it only for targets you trust.
The CLI and MCP server apply DefaultSecurityPolicy() on every run β SSRF and private-IP blocking, scheme restrictions, and redirect and body caps β so an agent pointing it at a hostile URL can't be turned against your network. Go library callers opt in for untrusted URLs by setting Options.Security.
--block-private-ips β On by default. Targets resolving to private or internal IPs are rejected β an SSRF guard for agents fed untrusted URLs. http / https only β Other schemes are refused before a request is ever made. --max-redirects 10 β Redirects are capped and each hop is re-validated against the policy. --max-response-bytes β Raw body capped at 50 MB; decompressed body at 200 MB β defuses decompression bombs. --allow-domains / --deny-domains β Restrict reachable hosts with suffix-match allow/deny lists. Deny always wins. --allow-internal β Explicit escape hatch to reach localhost or a private host β off unless you ask for it.
See SECURITY.md for the full threat model and coverage caveats (e.g. --proxy and --snapshot have narrower checks).