Overview

Parse is a synchronous API that extracts structured content from web pages. Give it a URL or raw HTML, and it returns structured data -- title, author, article text, images, and metadata -- in a single request. No polling, no webhooks. One call and you have your data.

What Parse Does

Send a POST /parse request with either a URL or raw HTML, and Parse returns structured content extracted by AI. The response includes the page's primary content, consumability assessment, and HTTP metadata from the scrape. Everything happens synchronously -- the response contains your extracted data.

How It Works

Parse operates in two modes:

URL mode: Provide a url. Parse loads the page in a headless browser, then runs AI extraction on the rendered content.
HTML mode: Provide raw HTML in the html field. Parse skips the browser scrape entirely and runs AI extraction directly on the provided markup.

The Extraction Pipeline

Regardless of mode, content flows through three stages:

Content acquisition -- For URL mode, a headless browser loads and renders the page. For HTML mode, the provided markup is stored directly.
AI-powered structured extraction -- An LLM with structured output analyzes the content, extracting title, author, body text, images, video, and metadata.
Consumability evaluation -- The extracted content is assessed to determine whether the page contains meaningful, self-contained content.

Understanding the Response

The response data object contains four top-level fields:

hasPrimaryContent (boolean) -- Quick check for whether the page had extractable content. Useful for filtering before inspecting the full response.

consumability (object) -- Contains isConsumable (boolean) and reason (string) explaining the assessment. See the Consumability section below.

primaryContent (object | null) -- The extracted content. Null when no content could be extracted. Contains:

Field	Type	Description
`title`	string \| null	Page or article title
`description`	string \| null	Summary or meta description
`author`	string \| null	Content author
`publisher`	string \| null	Publishing organization
`publishedAt`	string \| null	Original publication date
`updatedAt`	string \| null	Last update date
`isSponsored`	boolean \| null	Whether the content is sponsored
`isDigest`	boolean \| null	Whether the content is a digest or roundup
`accessRestrictionType`	string[] \| null	Detected access restrictions
`text`	object \| null	Body content with `simplifiedHtml`
`video`	object \| null	Video URL and duration if present
`primaryImage`	object \| null	Primary image with caption and credit
`originallyPublished`	object \| null	Original source info for syndicated content

All fields are nullable because not every page has every field.

scrape (object) -- Contains the HTTP status code from the browser scrape. Only present in URL mode. Absent from the response entirely in HTML mode.

Consumability

Consumability answers the question: does this page contain self-contained, meaningful content that a reader could consume?

Consumable examples: news articles, blog posts, product pages, event listings, documentation pages, recipe pages.

Not consumable examples: homepages with only navigation links, search results pages, error pages, login forms, bot detection pages, index pages.

The reason field provides a natural language explanation of the assessment:

"Page contains a full news article with headline, byline, and body text."
"Page is a 404 error with no consumable article content."
"No primary textual content found on the provided page copy."

Access Restrictions

Parse detects when content is blocked or restricted. The primaryContent.accessRestrictionType field returns an array of restriction types when detected, or null when no restrictions are found.

Restriction types:

Type	Description
`subscription-required`	Content behind a paywall
`bot-detected`	Page served a bot detection challenge
`captcha`	CAPTCHA presented instead of content
`adblock-detected`	Page blocked content due to ad blocker detection
`login-required`	Content requires authentication
`geo`	Content restricted by geographic location
`other`	Other restriction not covered above

Content Types

Parse handles a range of page types:

Text articles -- News stories, blog posts, documentation. Extracts title, author, body text, and metadata.
Video pages -- Extracts video URL and duration alongside any surrounding text content.
Image-heavy pages -- Extracts the primary image with caption and credit information.
Syndicated content -- Detects republished or syndicated articles and provides original source information via originallyPublished.

Idempotency

Parse supports an optional jobId parameter for deduplication:

jobId uniqueness is scoped per organization -- different orgs can reuse the same jobId.
Submitting the same jobId within the same org reconnects to the existing workflow result rather than starting a new parse.
Without a jobId, a random UUID is generated for each request.
If a previous job with the same jobId failed, Parse returns an error asking you to retry with a new jobId.

Processing Time

Processing times are approximate and vary based on page complexity, server load, and network conditions.

Scenario	Approximate Time
Simple pages (e.g., example.com)	~10-20s
Standard articles	~20-45s
Heavy JS-rendered pages	~30-60s
Maximum timeout	10 minutes

Limits

Limit	Value
Max HTML size	2MB
Max request body size	3MB
Max title length	1,000 characters
Max jobId length	256 characters
Workflow timeout	10 minutes

Next Steps

Quickstart: Get parsing working in under 2 minutes
API Reference: Complete endpoint documentation

Overview

What Parse Does​

How It Works​

The Extraction Pipeline​

Understanding the Response​

Consumability​

Access Restrictions​

Content Types​

Idempotency​

Processing Time​

Limits​

Next Steps​