Contents API Overview

View as Markdown

What is the Contents API?

The Contents API extracts clean HTML or Markdown content from a given URL. Pass it a list of URLs and get back the full page content for each, ready for LLM consumption—no parsing, no HTML noise, no browser automation required.


How it’s different from livecrawl

The Contents API and the livecrawl parameter in the Search API both extract full page content, but they serve different workflows:

Contents APISearch API + livecrawl
Starting pointYou already know the URLsYou have a query, not URLs
Use caseFetch known pages on demandEnrich search results with full content
URL sourceYou provide themYou.com search discovers them
Batch size10 URLs per requestUp to 100 results per search

Use the Contents API when you have a list of specific URLs you want to read. Use livecrawl when you want full content returned alongside search results.


What you get

Each URL in your request returns a structured object:

1[
2 {
3 "url": "https://competitor.com/pricing",
4 "title": "Pricing — Competitor Inc.",
5 "markdown": "# Pricing\n\n## Starter Plan\n$49/month...",
6 "html": "<html>...</html>",
7 "metadata": {
8 "site_name": "Competitor Inc.",
9 "favicon_url": "https://ydc-index.io/favicon?domain=competitor.com&size=128"
10 }
11 }
12]

You control which formats are returned via the formats parameter—request markdown, html, and/or metadata in any combination.


Key features

Any URL, on demand

Pass up to 10 URLs in a single request. The API crawls them all in parallel and returns the content. No need to manage a headless browser or deal with raw HTML yourself.

LLM-ready Markdown

The markdown format strips navigation menus, ads, footers, and other boilerplate. You get actual content of the page—ready to drop into a prompt.

Configurable timeout

Use crawl_timeout (1–60 seconds) to balance speed vs. completeness. For fast pages: 5–10 seconds. For heavy JavaScript-rendered pages: 20–30 seconds.

Metadata extraction

Request metadata alongside content to get the page’s site name and favicon URL—useful for building UIs that display source attribution.


Quickstart

1import os
2from youdotcom import You
3from youdotcom.models import ContentsFormats
4
5with You(api_key_auth="api_key") as you:
6 pages = you.contents.generate(
7 urls=["https://you.com/about"],
8 formats=[ContentsFormats.MARKDOWN],
9 )
10
11 for page in pages:
12 print(f"Title: {page.title}")
13 print(f"Content preview: {page.markdown[:300]}\n")

Parameters

ParameterTypeRequiredDescription
urlsarray of stringsYesThe URLs to fetch content from
formatsarray of stringsNoContent formats to return: markdown, html, metadata (default: markdown)
crawl_timeoutnumberNoPer-URL timeout in seconds, between 1 and 60 (default: 10)

View full API reference


Common use cases

Competitive intelligence

Monitor competitor pricing, feature, or blog pages. Fetch the content on a schedule, feed it to an LLM, and surface meaningful changes—without manual checking.

1from youdotcom import You
2from youdotcom.models import ContentsFormats
3
4competitor_pages = [
5 "https://competitor-a.com/pricing",
6 "https://competitor-b.com/pricing",
7 "https://competitor-c.com/features",
8]
9
10with You(api_key_auth="api_key") as you:
11 pages = you.contents.generate(
12 urls=competitor_pages,
13 formats=[ContentsFormats.MARKDOWN],
14 crawl_timeout=15,
15 )
16
17 for page in pages:
18 print(f"\n--- {page.title} ---")
19 # Feed page.markdown into your LLM for summarization or diff
20 print(page.markdown[:500])

Knowledge base ingestion

You have a list of authoritative sources—documentation pages, whitepapers, internal wikis. Fetch them all, convert to clean Markdown, and index into your vector store.

1from youdotcom import You
2from youdotcom.models import ContentsFormats
3
4# Known authoritative sources to index
5source_urls = [
6 "https://docs.example.com/api-reference",
7 "https://docs.example.com/authentication",
8 "https://docs.example.com/rate-limits",
9]
10
11with You(api_key_auth="api_key") as you:
12 pages = you.contents.generate(
13 urls=source_urls,
14 formats=[ContentsFormats.MARKDOWN, ContentsFormats.METADATA],
15 )
16
17 documents = []
18 for page in pages:
19 documents.append({
20 "source": page.url,
21 "title": page.title,
22 "content": page.markdown,
23 })
24 # Index document into your vector store here

Research assistant

Give users the ability to ask questions about specific URLs. Fetch the page content on the fly and feed it as context into your LLM—turning any URL into a searchable document.

1from youdotcom import You
2from youdotcom.models import ContentsFormats
3
4def fetch_url_context(url: str) -> str:
5 with You(api_key_auth="api_key") as you:
6 pages = you.contents.generate(urls=[url], formats=[ContentsFormats.MARKDOWN])
7 return pages[0].markdown if pages else ""
8
9# User asks: "Summarize this page for me"
10url = "https://example.com/long-report"
11context = fetch_url_context(url)
12
13prompt = f"Summarize the following page content:\n\n{context}"
14# Pass prompt to your LLM

Best practices

Request only the formats you need

Each format adds processing time. If you only need Markdown for LLM consumption, don’t request html. If you don’t need site metadata for your UI, skip metadata.

Batch your URLs

A single request with 10 URLs is faster than 10 separate requests. The API processes them in parallel.

Set crawl_timeout based on the target site

For simple static pages, 5–10 seconds is usually enough. For JavaScript-heavy pages (SPAs, dashboards), increase to 20–30 seconds to give the renderer time to complete.

Handle partial failures gracefully

If one URL in a batch fails to crawl (e.g., it’s behind a login wall or returns a 404), the API returns null for its markdown and html fields. Always check before processing:

1for page in pages:
2 if page.markdown:
3 # Process content
4 pass
5 else:
6 print(f"Failed to fetch: {page.url}")

Rate limits & pricing

Pricing is based on the number of URLs fetched per request. See you.com/pricing or contact api@you.com.


Next steps