CLI Reference

terminal
seaportal [options] <url>          # Extract Markdown / JSON / snapshot (default verb)
seaportal sitemap <url> [flags]    # Flatten a sitemap.xml (recurses sitemap-index)
seaportal feed <url> [flags]       # Parse RSS / Atom / JSON Feed into unified entries
seaportal mcp                      # Run as an MCP server over stdio (see mcp.md)
seaportal help                     # Show usage
seaportal --version                # Show version (also -v)

The default verb extracts a URL. It writes the rendered Markdown and JSON to renders/seaportal/<host>_<timestamp>.{md,json} and prints the content plus a classification summary. Use --json, --xml, or --snapshot to control the format.

Input modes

  • URL: seaportal https://example.com
  • stdin HTML: pipe HTML and pass --base-url to resolve relative links:
    terminal
    cat page.html | seaportal --base-url https://example.com
    In stdin mode, --head-only, --respect-robots, and --retries are ignored.

Extract flags

FlagDefaultDescription
--jsonfalseOutput JSON instead of Markdown
--xmlfalseOutput TEI-Lite XML (mutually exclusive with --json)
--snapshotfalseOutput accessibility tree (see Snapshot flags)
--fastfalseBail early if a browser is needed
--probe-searchfalseForce needs-browser for search pages with no result list
--no-dedupefalseDisable block deduplication (on by default)
--no-near-dedupefalseDisable simhash near-duplicate detection
--max-tokens N0Approximate token cap for output (0 = unlimited)
--retries N3Retry attempts for 502/503/504/429
--max-retry-wait D30sMax single backoff wait
--retry-timeout D90sTotal budget for all retries
--with-linksfalseEmit discovered <a> links
--with-imagesfalseEmit discovered <img> entries
--with-tablesfalseEmit structured tables
--with-commentsfalseEmit user comments separately
--links MODEallLink retention: none / text / all / footer
--citationsfalseNumbered references + ## References (synonym for --links=footer)
--chunk SPEC""heading / sentence[:SIZE] / window[:SIZE[:OVERLAP]]
--select CSS""CSS selector(s) to scope extraction (comma-separated)
--strip CSS""CSS selector(s) to remove before extraction
--head-onlyfalseFetch first 16 KB; metadata + canonical only
--no-prune-fallbackfalseDisable tag-density fallback for thin output
--respect-robotsfalseConsult robots.txt and refuse disallowed paths
--rate-limit D0Min interval between requests to the same host
--ua VALUE""UA preset (chrome/safari/firefox/googlebot/bingbot/seaportal/search-bot) or literal string
--base-url URL""Base URL for stdin HTML input
--proxy URL""http(s):// or socks5:// proxy
--no-pdffalseSkip PDF extraction
--schema PATH""CSS schema (JSON/YAML) → result.schema
--version, -vShow version

Cache

FlagDefaultDescription
--cache DIR""Enable on-disk cache at this directory
--cache-ttl D24hCache freshness window
--cache-stale-tolerance D0Stale-while-revalidate window past TTL
--no-cachefalseBypass cache reads (writes still happen if --cache set)

Query / BM25

FlagDefaultDescription
--query TEXT""Score sections by BM25 relevance
--top-n N0Keep only the top-N most relevant sections
--filter-by-queryfalseReplace content with concatenated top-N sections (default top-3)

Split output

FlagDefaultDescription
--split-out DIR""Write split output files into DIR (not supported with --xml)
--split-bytes N0Approx bytes per file (default --max-tokens × 4 or 32768)

Security

The CLI is safe by default: it applies DefaultSecurityPolicy() (SSRF / private-IP block on, http/https only, redirect + body caps). Loosen it only for trusted targets — e.g. --allow-internal to reach localhost / a private host. See SECURITY.md for the threat model and coverage caveats (--proxy skips the dial-time rebinding check).

FlagDefaultDescription
--block-private-ipstrueReject targets resolving to private/internal IPs (SSRF guard)
--allow-internal (alias --allow-private-ips)falseEscape hatch: allow private/internal IP targets
--max-redirects N10Max redirect hops (0 = none, -1 = unlimited); each hop is re-validated
--allow-domains LIST""Comma-separated host allowlist (suffix match); empty = allow any
--deny-domains LIST""Comma-separated host blocklist (suffix match; deny wins)
--trusted-resolve-cidrs LIST""CIDRs/IPs allowed to resolve to non-public addresses
--max-response-bytes N52428800Max raw response body bytes (0 = unlimited)
--max-decompressed-bytes N209715200Max decompressed body bytes — defuses decompression bombs (0 = unlimited)

Snapshot flags

FlagDefaultDescription
--snapshotfalseBuild the accessibility tree
--filter interactive""Keep only interactive elements
--format json|compactjsonJSON tree or readable text tree
--max-tokens N0Approximate token cap for the tree

seaportal sitemap

terminal
seaportal sitemap https://example.com/sitemap.xml [--json] [--max-urls N] [--max-depth N]
FlagDefaultDescription
--jsonfalseJSON array instead of newline-separated URLs
--max-urls N50000Stop after this many URLs
--max-depth N5Max sitemap-index recursion depth

Recurses nested <sitemapindex> references and auto-decompresses .gz sitemaps.

seaportal feed

terminal
seaportal feed https://example.com/feed.xml [--json] [--max-items N]
FlagDefaultDescription
--jsonfalseJSON array instead of TSV (published\ttitle\tlink)
--max-items N200Stop after this many items

Parses RSS 2.0, Atom 1.0, and JSON Feed 1.x into unified entries.