SeaPortal

Fast content extraction for AI agents

Turn static and server-rendered pages into clean Markdown or a semantic accessibility tree your agent can act on. HTTP-first and browserless — fast on the pages that allow it, and it tells you when one actually needs a browser. Use it as a CLI, a Go library, or an MCP server.

Go No browser MCP server MIT licensed

View on GitHub Docs

 $ npm install -g seaportal && seaportal https://example.com

It does not execute JavaScript, render pages, or manage sessions — by design. When a page needs any of that, SeaPortal tells you instead of guessing.

> Quick Start

Up and running in seconds

Install the binary, point it at a URL, get clean content back. No server to host, no browser to drive.

terminal

# Install via npmnpm install -g seaportal

# Install via Gogo install github.com/pinchtab/seaportal/cmd/seaportal@latest

# Extract a page as Markdownseaportal https://example.com# ---# title: "Example Domain"# url: https://example.com# confidence: 92# isSpa: false# pageClass: static# needsBrowser: false# ---## # Example Domain# This domain is for use in illustrative examples...

# Same fetch, machine-readable — branch on the profileseaportal --json https://example.com# {#   "title": "Example Domain",#   "confidence": 92,#   "isSpa": false,#   "pageClass": "static",#   "profile": { "decision": "static-ok", "browserRecommended": false }# }

> Why SeaPortal

More than a fetch

Your agent can already pull a URL with a plain fetch (the web_fetch tool, curl, an HTTP client). The difference is what comes back — a raw page dump versus structured content it can reason about and act on.

Plain URL fetch

SeaPortal

Output

Plain fetch: The whole page — nav, ads, scripts, boilerplate.

SeaPortal: Readable Markdown with the noise stripped, plus block and near-duplicate (simhash) dedupe.

Acting on the page

Plain fetch: Text only. No way to point at a button or field.

SeaPortal: A semantic accessibility tree — every interactive element carries a ref (e1, e2…) and CSS selector.

JavaScript-heavy sites

Plain fetch: Silently returns an empty shell.

SeaPortal: HTTP-first triage that flags when a page genuinely needs a browser — --fast bails early instead of guessing.

Relevance & size

Plain fetch: You get everything, every time.

SeaPortal: BM25 --query / --top-n scoring and a --max-tokens cap return only the sections that matter.

Beyond one URL

Plain fetch: One page per call.

SeaPortal: Flatten a sitemap.xml or parse an RSS / Atom / JSON feed in a single command.

Safety

Plain fetch: Happily resolves internal IPs and odd schemes.

SeaPortal: SSRF / private-IP blocking and http(s)-only by default.

Already wired for agents: run seaportal mcp and the same engine shows up as MCP tools.

> What It Does

Built for agents, not browsers

Fast, structured extraction you can drop straight into an agent loop.

Clean extraction

Readable Markdown with nav, ads, and chrome removed — plus block and simhash near-duplicate dedupe.

Selector-free scraping

Point it at any URL and get clean, structured output — no CSS selectors or brittle scraper scripts to write or maintain.

Accessibility snapshot

A semantic tree of the page where every node has a role, ref, and selector. Filter to interactive elements only.

HTTP-first, no browser

No headless overhead. On reachable static / SSR pages, extraction is typically p50 ~1s (50-site in-niche sweep — see Reliability).

Knows its limits

A fail-over triage, not a universal fetcher: it emits a browser-recommendation signal so you can route hard pages elsewhere.

Sitemaps & feeds

Flatten sitemap.xml (recurses indexes, auto-decompresses .gz) and parse RSS, Atom, and JSON Feed.

MCP server

Run seaportal mcp to expose fetch_url, fetch_snapshot, parse_sitemap, and parse_feed as agent tools over stdio.

Safe by default

SSRF / private-IP blocking, http(s)-only, and redirect and response-body caps applied out of the box.

CLI · library · MCP

One Go binary, or import github.com/pinchtab/seaportal and call FromURL directly. Same engine everywhere.

Latency figures come from committed live sweeps — a 50-site in-niche sample and the Tranco top-1000. See the Reliability snapshot for the full p50 / p90 / p95 breakdown and method.

Advanced, when you need it

The core stays a fast, no-browser fetch-and-extract primitive. These build on top of it — reach for them only when a pipeline calls for it.

BM25 relevance scoring (--query / --top-n)
Token-capped output (--max-tokens)
Structural chunking
Split output
TEI-Lite export
seabench eval harness

> Use Cases

Where it fits

A fast, no-browser fetch-and-extract primitive for agent and ingestion pipelines.

Agent URL triage

Hand an agent a URL and get clean content back fast — plus a clear signal when the page needs a real browser, so the agent can route instead of stalling.

RAG & docs ingestion

Turn documentation, articles, and knowledge bases into clean Markdown with dedupe and BM25 section scoring — ready to chunk and embed.

Sitemap & feed discovery

Flatten a sitemap.xml (recursing indexes) or parse an RSS / Atom / JSON feed to enumerate what to fetch next — in one command.

Safe untrusted-URL fetching

Point it at URLs you do not control with SSRF / private-IP blocking, http(s)-only, and redirect and body caps on by default.

Browser fallback routing

Use SeaPortal as the cheap first hop and branch on profile.decision / browserRecommended to escalate only the hard pages to a browser like PinchTab.

> Integration

Wire it into an agent loop

Use SeaPortal as the cheap first hop, then escalate only the hard pages. Every fetch carries a routing decision so the agent never has to guess.

agent loop

# Fetch once as JSON, then route on the profileresult=$(seaportal --json "$url")decision=$(echo "$result" | jq -r '.profile.decision')# profile.browserRecommended is the single boolean to branch on;# profile.decision is the detailed category behind it.case "$decision" in  static-high-confidence|static-ok)    # Clean extraction — hand result.content straight to the agent    ;;  browser-needed|blocked)    # Escalate to a real browser such as PinchTab    pinchtab open "$url"    ;;  *)    # not-found | unreachable | unsupported — skip or report    ;;esac

See the browser discriminator reference for every profile.decision value and exactly when each is emitted.

> Architecture

Simple by design

One binary, one extraction engine. There is no service to host — drive it from the command line, embed the Go library, or run it as an MCP server.

seaportal

# Markdown (default), JSON, or an accessibility treeseaportal https://example.comseaportal --json https://example.comseaportal --snapshot https://example.com

import "github.com/pinchtab/seaportal"result := seaportal.FromURL("https://example.com")fmt.Println(result.Content) // extracted Markdown

# Run as an MCP server over stdioseaportal mcp

> Commands

One binary, a handful of commands

Extraction is the default verb; sitemap, feed, and mcp are subcommands. The full flag list lives in the CLI reference.

Command	Description
`seaportal <url>`	Extract a page as Markdown (the default verb)
`seaportal --json <url>`	Structured JSON output instead of Markdown
`seaportal --snapshot <url>`	Accessibility tree with element refs and selectors
`seaportal --fast <url>`	Bail early if the page needs a browser
`seaportal --query "..." <url>`	Rank sections by BM25 relevance (pair with --top-n)
`seaportal sitemap <url>`	Flatten a sitemap.xml (recurses indexes, .gz aware)
`seaportal feed <url>`	Parse RSS / Atom / JSON Feed into unified entries
`seaportal mcp`	Run as an MCP server over stdio

See the CLI reference for every flag — output formats, dedupe, caching, retries, link/image/table extraction, and the security policy.

> Plain Language

The terms, decoded

A quick decoder for the jargon used across this page and the docs.

HTTP-first

SeaPortal fetches a page over plain HTTP and never launches a browser. That makes it fast and cheap on static and server-rendered pages — and means it does not run JavaScript.

Accessibility snapshot

A JSON tree of the page where every element has a role, a stable ref (e1, e2…), and a CSS selector — so an agent can point at a specific button or field instead of parsing raw text.

Browser-needed signal

On each fetch SeaPortal classifies the page and emits profile.decision and browserRecommended. When a page truly needs rendering, it tells you so you can route it to a browser.

MCP server

Model Context Protocol — run "seaportal mcp" to expose fetch_url, fetch_snapshot, parse_sitemap, and parse_feed as tools an MCP-aware agent can call over stdio.

> Security

Safe by default

SeaPortal runs locally — there is no server to lock down. The fetch path itself ships hardened, and you loosen it only for targets you trust.

The CLI and MCP server apply DefaultSecurityPolicy() on every run — SSRF and private-IP blocking, scheme restrictions, and redirect and body caps — so an agent pointing it at a hostile URL can't be turned against your network. Go library callers opt in for untrusted URLs by setting Options.Security.

→ --block-private-ips — On by default. Targets resolving to private or internal IPs are rejected — an SSRF guard for agents fed untrusted URLs.
→ http / https only — Other schemes are refused before a request is ever made.
→ --max-redirects 10 — Redirects are capped and each hop is re-validated against the policy.
→ --max-response-bytes — Raw body capped at 50 MB; decompressed body at 200 MB — defuses decompression bombs.
→ --allow-domains / --deny-domains — Restrict reachable hosts with suffix-match allow/deny lists. Deny always wins.
→ --allow-internal — Explicit escape hatch to reach localhost or a private host — off unless you ask for it.

See SECURITY.md for the full threat model and coverage caveats (e.g. --proxy and --snapshot have narrower checks).