SeaPortal

Fast content extraction for AI agents. HTTP-first, no browser required.

Install

terminal
# npm (recommended)
npm install -g seaportal

# Go
go install github.com/pinchtab/seaportal/cmd/seaportal@latest

Usage

terminal
seaportal https://pinchtab.com

# Options
seaportal --json https://pinchtab.com       # JSON output
seaportal --snapshot https://pinchtab.com   # Accessibility tree
seaportal --fast https://pinchtab.com       # Bail early if browser needed
seaportal --no-dedupe https://pinchtab.com  # Disable deduplication

# Subcommands
seaportal sitemap https://pinchtab.com/sitemap.xml  # Flatten a sitemap
seaportal feed https://pinchtab.com/feed.xml        # Parse RSS / Atom / JSON Feed
seaportal mcp                                       # Run as an MCP server over stdio

# Version
seaportal --version

The full flag list and subcommands are in the CLI reference. SeaPortal also runs as an MCP server (seaportal mcp), and ships seabench, a benchmark/evaluation harness.

Accessibility Snapshot

The --snapshot flag outputs a semantic accessibility tree β€” useful for AI agents that need to understand page structure and interact with elements:

terminal
seaportal --snapshot https://pinchtab.com
{
  "role": "document",
  "children": [
    {
      "role": "navigation",
      "name": "Main",
      "tag": "nav",
      "ref": "e1",
      "selector": "#main-nav",
      "depth": 0,
      "children": [
        {"role": "link", "name": "Home", "tag": "a", "ref": "e2", "selector": "a.nav-link", "depth": 1, "href": "/", "interactive": true}
      ]
    }
  ]
}

Each node includes:

  • role β€” Accessibility role (heading, link, button, textbox, etc.)
  • name β€” Accessible name (from aria-label, title, alt, or text)
  • tag β€” HTML tag name (div, a, button, etc.)
  • ref β€” Element reference (e1, e2…) for targeting
  • selector β€” CSS selector for the element
  • depth β€” Nesting depth in the tree
  • interactive β€” Whether the element can be clicked/typed
  • level β€” Heading level (1-6) for headings
  • href β€” Link target for links

Snapshot Options

terminal
# Filter to interactive elements only
seaportal --snapshot --filter=interactive https://example.com

# Compact text output (instead of JSON)
seaportal --snapshot --format=compact https://example.com

# Limit output size (approximate token count)
seaportal --snapshot --max-tokens=2000 https://example.com

# Combine options
seaportal --snapshot --filter=interactive --format=compact https://example.com

Compact format outputs a readable text tree:

document
  e1 navigation "Main" <nav> [interactive]
    e2 link "Home" <a> [interactive] href=/
    e3 link "Docs" <a> [interactive] href=/docs
  e4 main <main>
    e5 heading "Welcome" <h1> level=1

As a Library

The public package is the module root, github.com/pinchtab/seaportal:

import "github.com/pinchtab/seaportal"

// Extract content
result := seaportal.FromURL("https://pinchtab.com")
fmt.Println(result.Content) // extracted Markdown

// With options
result := seaportal.FromURLWithOptions("https://pinchtab.com", seaportal.Options{
    Dedupe:   true,
    FastMode: true,
})

// Build accessibility snapshot
snapshot, err := seaportal.BuildSnapshot(htmlString)

// Snapshot with options (filter, max tokens)
opts := seaportal.SnapshotOptions{
    FilterInteractive: true,
    MaxTokens:         2000,
}
snapshot, err := seaportal.BuildSnapshotWithOptions(htmlString, opts)

// Compact text output
fmt.Println(snapshot.ToCompact())

See the API reference for the full surface.

Features

  • Fast on its niche β€” Pure HTTP; on reachable static/SSR pages p50 ~1s, p95 ~2s (across the open web the tail is much longer)
  • Stealthy β€” Chrome TLS fingerprint, realistic headers
  • Smart β€” Readability extraction + Markdown conversion
  • Semantic β€” Accessibility tree for AI agents
  • Honest β€” Classifies pages, signals when browser is needed
  • Clean β€” Deduplicates repeated content blocks

Detection

Automatically detects:

  • Bot protection (Cloudflare, AWS WAF, DataDome, PerimeterX)
  • Captcha pages
  • Access denied / login walls
  • SPA / JavaScript-only content

Page Classification

ClassDescription
staticPure HTML, high confidence
ssrServer-rendered, good extraction
hydratedSSR + JS enhancement, usually extractable
spaJavaScript-only content, needs browser
dynamicHeavy client-side rendering
blockedBot protection, captcha, access denied

The quality float is an advisory soft signal, not a gate β€” clean server-rendered pages routinely score ~0 while extracting perfectly. Route on the page class and browser-recommendation signal (profile.decision / browserRecommended), not the raw quality value. See api.md and browser-discriminator.md.

Reliability / what to expect

SeaPortal is a fast first-pass triage that fails over, not a universal fetcher. It wins on static and server-rendered pages and tells you when to reach for a browser instead of pretending every URL extracts.

Numbers below are a frozen snapshot of the committed live sweeps β€” full breakdown, dates, and git SHAs in the reliability reference:

Reachable, in-niche (static/SSR)Across the open web (Tranco top-1000)
Latency (ok fetches)p50 ~1s, p95 ~2sp50 ~1.6s, p90 >10s, p95 ~15s
Success~94% ok40% ok β€” ~53% netting out the ~242 dead CDN/DNS infra hosts

What that means in practice:

  • In its niche it’s fast and reliable.
  • Across the raw open web, ~1 in 3 hosts time out and ~1 in 4 error β€” many are CDN/DNS infrastructure domains (akamaiedge.net, cloudfront.net, …) that never serve HTML.
  • Treat extraction as triage: set --timeout and route on the browser-recommendation signal (profile.decision / browserRecommended), failing over to a real browser rather than assuming the happy path.

Regenerate any time with ./dev bench sweep (see the seabench reference).

What It Doesn’t Do

  • JavaScript execution
  • Full browser rendering
  • Cookie/session management

For JS-heavy pages, use a browser and pass HTML to seaportal.FromHTML().

Core vs. advanced surfaces

SeaPortal is, first, one thing: a fast, no-browser fetch-and-extract primitive that returns clean Markdown + an accessibility snapshot and tells you when a page needs a browser. That is the core, and everything in the value prop above describes it.

Layered on top are secondary, opt-in helpers β€” useful, but not the identity and off by default:

SurfaceWhat it isWhere
Chunking (--chunk)Split Markdown into heading/sentence/window chunks for RAGapi.md
BM25 ranking (--query)Score heading-bounded sections by relevanceapi.md
Split output (--split-*)Shard a large extraction across filesapi.md
TEI-Lite XML (--xml)Wrap a result as TEI-Lite for corpus toolingapi.md
Sitemaps & feedsFlatten sitemap.xml, parse RSS/Atom/JSON Feedapi.md
seabenchBenchmark / capability harness β€” dev tooling, not shipped productseabench.md

If you only want the core, ignore all of the above: seaportal <url> and seaportal.FromURL(...) never touch them.

License

MIT