SeaPortal

Fast content extraction for AI agents. HTTP-first, no browser required.

Install

terminal

# npm (recommended)
npm install -g seaportal

# Go
go install github.com/pinchtab/seaportal/cmd/seaportal@latest

Usage

terminal

seaportal https://pinchtab.com

# Options
seaportal --json https://pinchtab.com       # JSON output
seaportal --snapshot https://pinchtab.com   # Accessibility tree
seaportal --fast https://pinchtab.com       # Bail early if browser needed
seaportal --no-dedupe https://pinchtab.com  # Disable deduplication

# Subcommands
seaportal sitemap https://pinchtab.com/sitemap.xml  # Flatten a sitemap
seaportal feed https://pinchtab.com/feed.xml        # Parse RSS / Atom / JSON Feed
seaportal scrape https://pinchtab.com               # Scrape a whole site → structured JSON
seaportal mcp                                       # Run as an MCP server over stdio

# Version
seaportal --version

The full flag list and subcommands are in the CLI reference. SeaPortal also runs as an MCP server (seaportal mcp), and ships seabench, a benchmark/evaluation harness.

Site scraping

seaportal scrape <base-url> (and the ScrapeSite library call / scrape_site MCP tool) crawls a whole site: it discovers URLs via robots.txt + sitemap (or a bounded crawl fallback), clusters similar paths into pattern groups, samples within a page budget, and extracts each page into a structured ScrapeResult (site, pageGroups, pages, summary). The JSON is designed as a hand-off to PinchTab for deep browser enrichment — screenshots, console/network capture, visual regression, and accessibility checks — keeping SeaPortal focused on fast HTTP-level discovery and extraction. See the scrape CLI docs.

Accessibility Snapshot

The --snapshot flag outputs a semantic accessibility tree — useful for AI agents that need to understand page structure and interact with elements:

terminal

seaportal --snapshot https://pinchtab.com

{
  "role": "document",
  "children": [
    {
      "role": "navigation",
      "name": "Main",
      "tag": "nav",
      "ref": "e1",
      "selector": "#main-nav",
      "depth": 0,
      "children": [
        {"role": "link", "name": "Home", "tag": "a", "ref": "e2", "selector": "a.nav-link", "depth": 1, "href": "/", "interactive": true}
      ]
    }
  ]
}

Each node includes:

role — Accessibility role (heading, link, button, textbox, etc.)
name — Accessible name (from aria-label, title, alt, or text)
tag — HTML tag name (div, a, button, etc.)
ref — Element reference (e1, e2…) for targeting
selector — CSS selector for the element
depth — Nesting depth in the tree
interactive — Whether the element can be clicked/typed
level — Heading level (1-6) for headings
href — Link target for links

Snapshot Options

terminal

# Filter to interactive elements only
seaportal --snapshot --filter=interactive https://example.com

# Compact text output (instead of JSON)
seaportal --snapshot --format=compact https://example.com

# Limit output size (approximate token count)
seaportal --snapshot --max-tokens=2000 https://example.com

# Combine options
seaportal --snapshot --filter=interactive --format=compact https://example.com

Compact format outputs a readable text tree:

document
  e1 navigation "Main" <nav> [interactive]
    e2 link "Home" <a> [interactive] href=/
    e3 link "Docs" <a> [interactive] href=/docs
  e4 main <main>
    e5 heading "Welcome" <h1> level=1

As a Library

The public package is the module root, github.com/pinchtab/seaportal:

import "github.com/pinchtab/seaportal"

// Extract content
result := seaportal.FromURL("https://pinchtab.com")
fmt.Println(result.Content) // extracted Markdown

// With options
result := seaportal.FromURLWithOptions("https://pinchtab.com", seaportal.Options{
    Dedupe:   true,
    FastMode: true,
})

// Build accessibility snapshot
snapshot, err := seaportal.BuildSnapshot(htmlString)

// Snapshot with options (filter, max tokens)
opts := seaportal.SnapshotOptions{
    FilterInteractive: true,
    MaxTokens:         2000,
}
snapshot, err := seaportal.BuildSnapshotWithOptions(htmlString, opts)

// Compact text output
fmt.Println(snapshot.ToCompact())

See the API reference for the full surface.

Features

Fast on its niche — Pure HTTP; on reachable static/SSR pages p50 ~1s, p95 ~2s (across the open web the tail is much longer)
Stealthy — Chrome TLS fingerprint, realistic headers
Smart — Readability extraction + Markdown conversion
Semantic — Accessibility tree for AI agents
Honest — Classifies pages, signals when browser is needed
Clean — Deduplicates repeated content blocks

Detection

Automatically detects:

Bot protection (Cloudflare, AWS WAF, DataDome, PerimeterX)
Captcha pages
Access denied / login walls
SPA / JavaScript-only content

Page Classification

Class	Description
`static`	Pure HTML, high confidence
`ssr`	Server-rendered, good extraction
`hydrated`	SSR + JS enhancement, usually extractable
`spa`	JavaScript-only content, needs browser
`dynamic`	Heavy client-side rendering
`blocked`	Bot protection, captcha, access denied

The quality float is an advisory soft signal, not a gate — clean server-rendered pages routinely score ~0 while extracting perfectly. Route on the page class and browser-recommendation signal (profile.decision / browserRecommended), not the raw quality value. See api.md and browser-discriminator.md.

Reliability / what to expect

SeaPortal is a fast first-pass triage that fails over, not a universal fetcher. It wins on static and server-rendered pages and tells you when to reach for a browser instead of pretending every URL extracts.

Numbers below are a frozen snapshot of the committed live sweeps — full breakdown, dates, and git SHAs in the reliability reference:

	Reachable, in-niche (static/SSR)	Across the open web (Tranco top-1000)
Latency (ok fetches)	p50 ~1s, p95 ~2s	p50 ~1.6s, p90 >10s, p95 ~15s
Success	~94% ok	40% ok — ~53% netting out the ~242 dead CDN/DNS infra hosts

What that means in practice:

In its niche it’s fast and reliable.
Across the raw open web, ~1 in 3 hosts time out and ~1 in 4 error — many are CDN/DNS infrastructure domains (akamaiedge.net, cloudfront.net, …) that never serve HTML.
Treat extraction as triage: set --timeout and route on the browser-recommendation signal (profile.decision / browserRecommended), failing over to a real browser rather than assuming the happy path.

Regenerate any time with ./dev bench sweep (see the seabench reference).

What It Doesn’t Do

JavaScript execution
Full browser rendering
Cookie/session management

For JS-heavy pages, use a browser and pass HTML to seaportal.FromHTML().

Core vs. advanced surfaces

SeaPortal is, first, one thing: a fast, no-browser fetch-and-extract primitive that returns clean Markdown + an accessibility snapshot and tells you when a page needs a browser. That is the core, and everything in the value prop above describes it.

Layered on top are secondary, opt-in helpers — useful, but not the identity and off by default:

Surface	What it is	Where
Chunking (`--chunk`)	Split Markdown into heading/sentence/window chunks for RAG	api.md
BM25 ranking (`--query`)	Score heading-bounded sections by relevance	api.md
Split output (`--split-*`)	Shard a large extraction across files	api.md
TEI-Lite XML (`--xml`)	Wrap a result as TEI-Lite for corpus tooling	api.md
Sitemaps & feeds	Flatten `sitemap.xml`, parse RSS/Atom/JSON Feed	api.md
`seabench`	Benchmark / capability harness — dev tooling, not shipped product	seabench.md

If you only want the core, ignore all of the above: seaportal <url> and seaportal.FromURL(...) never touch them.

License

MIT