API Reference

The public Go API is the module-root package seaportal (github.com/pinchtab/seaportal). All implementation lives under internal/; this package re-exports the stable surface.

import "github.com/pinchtab/seaportal"

Extraction

// Default options.
result := seaportal.FromURL("https://example.com")
fmt.Println(result.Content) // extracted Markdown body

// Custom options.
result := seaportal.FromURLWithOptions("https://example.com", seaportal.Options{
    FastMode:  true,
    WithLinks: true,
})

// From raw HTML you already have (e.g. fetched by a browser).
result := seaportal.FromHTML(htmlString, "https://example.com")
result := seaportal.FromHTMLWithOptions(htmlString, "https://example.com", opts)
FunctionSignature
FromURLfunc(targetURL string) Result
FromURLWithOptionsfunc(targetURL string, opts Options) Result
FromURLWithDedupefunc(targetURL string) Result
FromHTMLfunc(html, targetURL string) Result
FromHTMLWithOptionsfunc(html, targetURL string, opts Options) Result
FromResponsefunc(resp *http.Response, targetURL string, start time.Time) Result
ExtractFromHTMLfunc(html, targetURL string) (string, error) — Markdown only
ResultToTEIXMLfunc(r Result) ([]byte, error) — TEI-Lite XML

Extraction functions return Result by value and never error; transport and parse failures are reported on Result.Error, Result.StatusCode, and Result.Profile.

Options

Selected fields (see seaportal.go / internal/engine for the full struct):

FieldTypeDefaultDescription
DedupeboolfalseRemove duplicate content blocks
NoNearDedupeboolfalseDisable simhash near-duplicate detection
FastModeboolfalseBail early if a browser is likely needed
ProbeSearchboolfalseForce needs-browser for search pages with no result list
MaxRetriesintRetry attempts for 502/503/504/429
MaxRetryWaittime.DurationMax single backoff wait
TotalRetryTimeouttime.DurationTotal budget across retries
WithLinksboolfalseEmit discovered <a> links on Result
WithImagesboolfalseEmit discovered <img> entries
WithTablesboolfalseEmit structured tables
WithCommentsboolfalseEmit user comments on Result.Comments
CitationsboolfalseNumbered references + ## References (alias for LinkRetention = footer)
LinkRetentionLinkRetentionLinkRetentionAllnone / text / all / footer
ChunkChunkConfigoffMarkdown chunking strategy
SelectCSS / StripCSSstring""CSS selectors to scope / remove before extraction
MaxTokensint0Approximate output token cap (0 = unlimited)
HeadOnlyboolfalseFetch first 16 KB, metadata + canonical only
NoPruneFallbackboolfalseDisable tag-density fallback for thin output
RespectRobotsboolfalseConsult robots.txt before fetching
RateLimittime.Duration0Min interval between requests to the same host
UserAgentstring""Preset name or literal UA string
Proxystring""http(s):// or socks5:// proxy URL
CacheDirstring""Enable on-disk cache at this path
CacheTTLtime.DurationCache freshness window
CacheStaleTolerancetime.Duration0Stale-while-revalidate window
NoCacheboolfalseBypass cache reads
NoPDFboolfalseSkip PDF extraction
SchemaPathstring""CSS schema file for structured extraction
Querystring""BM25 query to score sections
TopNint0Keep only top-N sections
FilterByQueryboolfalseReplace Content with top-N sections
SplitOut / SplitBytesstring / int"" / 0Split output across files
Security*SecurityPolicynilSSRF / private-IP / redirect / decompression guard (see below)

Security policy

Options.Security threads an SSRF / private-IP / redirect / decompression guard through the whole fetch path. A nil policy (the zero value) disables every check — the historical unguarded behaviour — so a library caller handling untrusted URLs must set one. The CLI and the MCP server apply DefaultSecurityPolicy() automatically.

result := seaportal.FromURLWithOptions(untrustedURL, seaportal.Options{
    Security: seaportal.DefaultSecurityPolicy(), // block private IPs, http/https, caps
})
if result.SecurityBlock != "" {
    // refused by policy (SSRF / scheme / domain / size cap)
}
type SecurityPolicy struct {
    BlockPrivateIPs      bool     // reject RFC1918/loopback/link-local/ULA/CGNAT/metadata IPs
    TrustedResolveCIDRs  []string // CIDRs/IPs allowed to resolve non-public (escape hatch)
    AllowedDomains       []string // host allowlist (suffix match); empty = any
    DeniedDomains        []string // host blocklist (suffix match; deny wins)
    AllowedSchemes       []string // default {"http","https"}; rejects file:/ftp:/gopher:/ws:
    MaxRedirects         int      // >0 cap, 0 none, -1 unlimited
    RevalidateRedirects  bool     // re-validate scheme+host+IP on every redirect hop
    MaxResponseBytes     int64    // cap raw body (0 = unbounded)
    MaxDecompressedBytes int64    // cap decompressor output — defuses zip bombs (0 = unbounded)
}

DefaultSecurityPolicy() returns BlockPrivateIPs: true, http/https only, MaxRedirects: 10 with RevalidateRedirects: true, and 50 MiB / 200 MiB body caps. Enforcement runs pre-fetch, at the dial Control hook (closing the DNS-rebinding window), on every redirect hop, and at the read/decompress boundary. A refusal sets both Result.Error and Result.SecurityBlock. See SECURITY.md.

Result

Result (internal/engine/result.go) marshals to camelCase JSON. Core fields:

type Result struct {
    URL          string      `json:"url"`
    CanonicalURL string      `json:"canonicalUrl,omitempty"`
    Title        string      `json:"title"`
    Content      string      `json:"content"`      // extracted Markdown
    Byline       string      `json:"byline"`
    Excerpt      string      `json:"excerpt"`
    SiteName     string      `json:"sitename"`
    Language     string      `json:"language,omitempty"`
    Length       int         `json:"length"`
    TimeMs       int64       `json:"timeMs"`
    Confidence   int         `json:"confidence"`
    IsSPA        bool        `json:"isSpa"`
    IsBlocked    bool        `json:"isBlocked"`
    SPASignals   []string    `json:"spaSignals,omitempty"`
    Quality      float64     `json:"quality"`      // advisory soft signal — do NOT route on it (see note below)
    Profile      PageProfile `json:"profile"`      // classification + browser-routing decision
    PageClass    PageClass   `json:"pageClass"`
    Validation   Validation  `json:"validation"`
    Fingerprint  string      `json:"fingerprint"`
    Error         string     `json:"error,omitempty"`
    SecurityBlock string     `json:"securityBlock,omitempty"` // reason a SecurityPolicy refused the fetch
    StatusCode    int        `json:"statusCode,omitempty"`
    // ... plus cache, timing, redirect, and response-header forensics fields
}

quality is a soft signal, not a gate. The float is deliberately noisy: excellent server-rendered pages routinely score near zero (Wikipedia and GitHub score 0; theguardian.com extracts ~36k clean chars yet scores 0) while still being fully extractable. Route on profile.decision / profile.browserRecommended (and profile.class + profile.outcome) — never on the raw quality float. See browser-discriminator.md for why quality is not a routing input, and classifier-validation.md for the held-out evidence that the routing decision generalizes (31/31) while the raw 6-way class does not (0.71).

Classification

profile := seaportal.ClassifyPage(result)            // PageProfile
signals, isSPA := seaportal.DetectSPA(htmlString)
blocked := seaportal.DetectBlocked(htmlString)
needsBrowser, reason := seaportal.QuickNeedsBrowser(htmlString)

PageProfile:

type PageProfile struct {
    Class              PageClass         `json:"class"`
    Outcome            ExtractionOutcome `json:"outcome"`
    Decision           BrowserDecision   `json:"decision"`
    BrowserRecommended bool              `json:"browserRecommended"`
    Reasons            []string          `json:"reasons"`
    Confidence         int               `json:"confidence"`
    Trustworthy        bool              `json:"trustworthy"`
}
  • PageClass: static, ssr, hydrated, spa, dynamic, blocked.
  • ExtractionOutcome: extract, extract-with-warning, fail-fast, needs-browser.
  • BrowserDecision + BrowserRecommended drive browser fall-through — see browser-discriminator.md.

Snapshots

node, err := seaportal.BuildSnapshot(htmlString)
node, err := seaportal.BuildSnapshotWithOptions(htmlString, seaportal.SnapshotOptions{
    FilterInteractive: true,
    MaxTokens:         2000,
})
fmt.Println(node.ToCompact()) // readable text tree

SnapshotNode fields: role, name, tag, ref (e.g. e5), selector, depth, interactive, level, value, href, checked, disabled, children.

Content processing

Secondary surfaces — opt-in helpers around the core extract primitive, not part of the default path. See Core vs. advanced surfaces.

seaportal.Dedupe(content)                            // DedupeResult
seaportal.DedupeWithOptions(content, opts)
seaportal.CleanupMarkdown(md)                         // string
seaportal.PreprocessHTML(html)                        // string
seaportal.RankSections(content, query, k1, b, topN)   // []RankedSection (BM25; k1/b 0 → 1.5/0.75)
seaportal.ChunkMarkdown(md, cfg)                      // []Chunk
seaportal.SplitResultToFiles(result, cfg)             // ([]SplitFile, error)

Sitemaps & feeds

Secondary surfaces — convenience parsers, not part of the core extract path. See Core vs. advanced surfaces.

entries, err := seaportal.FlattenSitemap(ctx, url, seaportal.FlattenSitemapOptions{
    MaxDepth: 5, MaxURLs: 50000, Timeout: 30 * time.Second,
})
items, err := seaportal.ParseFeed(ctx, url, seaportal.ParseFeedOptions{
    MaxItems: 200, Timeout: 30 * time.Second,
})

ParseFeed handles RSS 2.0, Atom 1.0, and JSON Feed 1.x into a unified []FeedItem.

Fingerprinting

fp := seaportal.SemanticFingerprint(content)          // string
changed := seaportal.ContentChanged(old, new)         // bool

CLI

The same surface is available from the command line — see cli.md, mcp.md (MCP server), and seabench.md (benchmark harness).