Contents API Overview

View as Markdown

What is the Contents API?

The Contents API extracts clean HTML or Markdown content from any URL you provide. Pass it a list of URLs and get back the full page content, ready for LLM consumption—no parsing, no HTML noise, no browser automation required.

TL;DR: Give it URLs, get back clean content. Works on any publicly accessible webpage.


How it’s different from livecrawl

The Contents API and the livecrawl parameter in the Search API both extract full page content, but they serve different workflows:

Contents APISearch API + livecrawl
Starting pointYou already know the URLsYou have a query, not URLs
Use caseFetch known pages on demandEnrich search results with full content
URL sourceYou provide themYou.com search discovers them
Batch size10 URLs per requestUp to 100 results per search

Use the Contents API when you have a list of specific URLs you want to read. Use livecrawl when you want full content returned alongside search results.


What you get

Each URL in your request returns a structured object:

1[
2 {
3 "url": "https://competitor.com/pricing",
4 "title": "Pricing — Competitor Inc.",
5 "markdown": "# Pricing\n\n## Starter Plan\n$49/month...",
6 "html": "<html>...</html>",
7 "metadata": {
8 "site_name": "Competitor Inc.",
9 "favicon_url": "https://ydc-index.io/favicon?domain=competitor.com&size=128"
10 }
11 }
12]

You control which formats are returned via the formats parameter—request markdown, html, and/or metadata in any combination.


Key features

Any URL, on demand

Pass up to 10 URLs in a single request. The API crawls each one in parallel and returns the content. No need to manage a headless browser or deal with raw HTML yourself.

LLM-ready Markdown

The markdown format strips navigation menus, ads, footers, and other boilerplate. You get actual content of the page—ready to drop into a prompt.

Configurable timeout

Use crawl_timeout (1–60 seconds) to balance speed vs. completeness. For fast pages: 5–10 seconds. For heavy JavaScript-rendered pages: 20–30 seconds.

Metadata extraction

Request metadata alongside content to get the page’s site name and favicon URL—useful for building UIs that display source attribution.


Quickstart

1import os
2import requests
3
4API_KEY = os.environ["YOU_API_KEY"]
5
6response = requests.post(
7 "https://ydc-index.io/v1/contents",
8 headers={"X-API-Key": API_KEY},
9 json={
10 "urls": ["https://you.com/about"],
11 "formats": ["markdown"],
12 },
13)
14
15pages = response.json()
16for page in pages:
17 print(f"Title: {page['title']}")
18 print(f"Content preview: {page['markdown'][:300]}\n")

Parameters

ParameterTypeRequiredDescription
urlsarray of stringsYesThe URLs to fetch content from
formatsarray of stringsNoContent formats to return: markdown, html, metadata (default: markdown)
crawl_timeoutnumberNoPer-URL timeout in seconds, between 1 and 60 (default: 10)

View full API reference


Common use cases

Competitive intelligence

Monitor competitor pricing, feature, or blog pages. Fetch the content on a schedule, feed it to an LLM, and surface meaningful changes—without manual checking.

1import requests
2
3API_KEY = "YOUR_API_KEY"
4
5competitor_pages = [
6 "https://competitor-a.com/pricing",
7 "https://competitor-b.com/pricing",
8 "https://competitor-c.com/features",
9]
10
11response = requests.post(
12 "https://ydc-index.io/v1/contents",
13 headers={"X-API-Key": API_KEY},
14 json={
15 "urls": competitor_pages,
16 "formats": ["markdown"],
17 "crawl_timeout": 15,
18 },
19)
20
21pages = response.json()
22for page in pages:
23 print(f"\n--- {page['title']} ---")
24 # Feed page['markdown'] into your LLM for summarization or diff
25 print(page["markdown"][:500])

Knowledge base ingestion

You have a list of authoritative sources—documentation pages, whitepapers, internal wikis. Fetch them all, convert to clean Markdown, and index into your vector store.

1import requests
2
3API_KEY = "YOUR_API_KEY"
4
5# Known authoritative sources to index
6source_urls = [
7 "https://docs.example.com/api-reference",
8 "https://docs.example.com/authentication",
9 "https://docs.example.com/rate-limits",
10]
11
12response = requests.post(
13 "https://ydc-index.io/v1/contents",
14 headers={"X-API-Key": API_KEY},
15 json={
16 "urls": source_urls,
17 "formats": ["markdown", "metadata"],
18 },
19)
20
21documents = []
22for page in response.json():
23 documents.append({
24 "source": page["url"],
25 "title": page["title"],
26 "content": page["markdown"],
27 })
28 # Index document into your vector store here

Research assistant

Give users the ability to ask questions about specific URLs. Fetch the page content on the fly and feed it as context into your LLM—turning any URL into a searchable document.

1import requests
2
3API_KEY = "YOUR_API_KEY"
4
5def fetch_url_context(url: str) -> str:
6 response = requests.post(
7 "https://ydc-index.io/v1/contents",
8 headers={"X-API-Key": API_KEY},
9 json={"urls": [url], "formats": ["markdown"]},
10 )
11 pages = response.json()
12 return pages[0]["markdown"] if pages else ""
13
14# User asks: "Summarize this page for me"
15url = "https://example.com/long-report"
16context = fetch_url_context(url)
17
18prompt = f"Summarize the following page content:\n\n{context}"
19# Pass prompt to your LLM

Best practices

Request only the formats you need

Each format adds processing time. If you only need Markdown for LLM consumption, don’t request html. If you don’t need site metadata for your UI, skip metadata.

Batch your URLs

A single request with 10 URLs is faster than 10 separate requests. The API processes them in parallel.

Set crawl_timeout based on the target site

For simple static pages, 5–10 seconds is usually enough. For JavaScript-heavy pages (SPAs, dashboards), increase to 20–30 seconds to give the renderer time to complete.

Handle partial failures gracefully

If one URL in a batch fails to crawl (e.g., it’s behind a login wall or returns a 404), the API returns null for its markdown and html fields. Always check before processing:

1for page in response.json():
2 if page.get("markdown"):
3 # Process content
4 pass
5 else:
6 print(f"Failed to fetch: {page['url']}")

Rate limits & pricing

Pricing is based on the number of URLs fetched per request. For detailed pricing information, visit you.com/platform/upgrade or contact api@you.com.


Next steps