Skip to main content

Overview

Parse is a synchronous API that extracts structured content from web pages. Give it a URL or raw HTML, and it returns structured data -- title, author, article text, images, and metadata -- in a single request. No polling, no webhooks. One call and you have your data.

What Parse Does

Send a POST /parse request with either a URL or raw HTML, and Parse returns structured content extracted by AI. The response includes the page's primary content, consumability assessment, and HTTP metadata from the scrape. Everything happens synchronously -- the response contains your extracted data.

How It Works

Parse operates in two modes:

  • URL mode: Provide a url. Parse loads the page in a headless browser, then runs AI extraction on the rendered content.
  • HTML mode: Provide raw HTML in the html field. Parse skips the browser scrape entirely and runs AI extraction directly on the provided markup.

The Extraction Pipeline

Regardless of mode, content flows through three stages:

  1. Content acquisition -- For URL mode, a headless browser loads and renders the page. For HTML mode, the provided markup is stored directly.
  2. AI-powered structured extraction -- An LLM with structured output analyzes the content, extracting title, author, body text, images, video, and metadata.
  3. Consumability evaluation -- The extracted content is assessed to determine whether the page contains meaningful, self-contained content.

Understanding the Response

The response data object contains four top-level fields:

hasPrimaryContent (boolean) -- Quick check for whether the page had extractable content. Useful for filtering before inspecting the full response.

consumability (object) -- Contains isConsumable (boolean) and reason (string) explaining the assessment. See the Consumability section below.

primaryContent (object | null) -- The extracted content. Null when no content could be extracted. Contains:

FieldTypeDescription
titlestring | nullPage or article title
descriptionstring | nullSummary or meta description
authorstring | nullContent author
publisherstring | nullPublishing organization
publishedAtstring | nullOriginal publication date
updatedAtstring | nullLast update date
isSponsoredboolean | nullWhether the content is sponsored
isDigestboolean | nullWhether the content is a digest or roundup
accessRestrictionTypestring[] | nullDetected access restrictions
textobject | nullBody content with simplifiedHtml
videoobject | nullVideo URL and duration if present
primaryImageobject | nullPrimary image with caption and credit
originallyPublishedobject | nullOriginal source info for syndicated content

All fields are nullable because not every page has every field.

scrape (object) -- Contains the HTTP status code from the browser scrape. Only present in URL mode. Absent from the response entirely in HTML mode.

Consumability

Consumability answers the question: does this page contain self-contained, meaningful content that a reader could consume?

Consumable examples: news articles, blog posts, product pages, event listings, documentation pages, recipe pages.

Not consumable examples: homepages with only navigation links, search results pages, error pages, login forms, bot detection pages, index pages.

The reason field provides a natural language explanation of the assessment:

  • "Page contains a full news article with headline, byline, and body text."
  • "Page is a 404 error with no consumable article content."
  • "No primary textual content found on the provided page copy."

Access Restrictions

Parse detects when content is blocked or restricted. The primaryContent.accessRestrictionType field returns an array of restriction types when detected, or null when no restrictions are found.

Restriction types:

TypeDescription
subscription-requiredContent behind a paywall
bot-detectedPage served a bot detection challenge
captchaCAPTCHA presented instead of content
adblock-detectedPage blocked content due to ad blocker detection
login-requiredContent requires authentication
geoContent restricted by geographic location
otherOther restriction not covered above

Content Types

Parse handles a range of page types:

  • Text articles -- News stories, blog posts, documentation. Extracts title, author, body text, and metadata.
  • Video pages -- Extracts video URL and duration alongside any surrounding text content.
  • Image-heavy pages -- Extracts the primary image with caption and credit information.
  • Syndicated content -- Detects republished or syndicated articles and provides original source information via originallyPublished.

Idempotency

Parse supports an optional jobId parameter for deduplication:

  • jobId uniqueness is scoped per organization -- different orgs can reuse the same jobId.
  • Submitting the same jobId within the same org reconnects to the existing workflow result rather than starting a new parse.
  • Without a jobId, a random UUID is generated for each request.
  • If a previous job with the same jobId failed, Parse returns an error asking you to retry with a new jobId.

Processing Time

Processing times are approximate and vary based on page complexity, server load, and network conditions.

ScenarioApproximate Time
Simple pages (e.g., example.com)~10-20s
Standard articles~20-45s
Heavy JS-rendered pages~30-60s
Maximum timeout10 minutes

Limits

LimitValue
Max HTML size2MB
Max request body size3MB
Max title length1,000 characters
Max jobId length256 characters
Workflow timeout10 minutes

Next Steps