Retrieve page content

Live-crawl search results to get full HTML or Markdown page content. Ideal for RAG, knowledge base construction, and deep content analysis.

View as Markdown

Overview

By default, search results include snippets — 100–200 words of extracted text per result. Enable live crawling to get the full page content: typically 2,000–10,000 words of clean HTML or Markdown per result.

This is what enables:

  • Deep RAG with full document context
  • Knowledge base construction from live web data
  • Comprehensive content synthesis across sources
  • Full article bodies for news results

How it works

Add livecrawl to any search request. The API fetches each matching result’s page in real time and attaches a contents object to it. You choose which result types to crawl and what format to return.

ParameterTypeOptionsDescription
livecrawlstringweb, news, allWhich result types to crawl
livecrawl_formatsstringhtml, markdownFormat for returned content
crawl_timeoutinteger160 (default 10)Max seconds to wait per page

markdown is recommended for LLM use cases — it strips navigation, ads, and boilerplate HTML, leaving only the core content.

Crawl web results

Set livecrawl=web to attach full page content to web results. The contents.markdown (or contents.html) field is added to each result that was successfully crawled.

1from youdotcom import You
2from youdotcom.models import LiveCrawl, LiveCrawlFormats
3
4with You(api_key_auth="api_key") as you:
5 res = you.search.unified(
6 query="transformer architecture explained",
7 count=5,
8 livecrawl=LiveCrawl.WEB,
9 livecrawl_formats=LiveCrawlFormats.MARKDOWN,
10 )
11
12 if res.results and res.results.web:
13 for result in res.results.web:
14 print(f"{result.title}")
15 print(f" URL: {result.url}")
16 if result.contents:
17 print(f" Content ({len(result.contents.markdown)} chars)")
18 print(f" Preview: {result.contents.markdown[:200]}...\n")
19 else:
20 print(" (No content retrieved)\n")

Crawl news results

Set livecrawl=news to get full article bodies for news results. Combine with freshness for breaking news pipelines.

1from youdotcom import You
2from youdotcom.models import Freshness, LiveCrawl, LiveCrawlFormats
3
4with You(api_key_auth="api_key") as you:
5 res = you.search.unified(
6 query="semiconductor supply chain",
7 freshness=Freshness.WEEK,
8 count=5,
9 livecrawl=LiveCrawl.NEWS,
10 livecrawl_formats=LiveCrawlFormats.MARKDOWN,
11 )
12
13 if res.results and res.results.news:
14 for article in res.results.news:
15 print(f"{article.title}")
16 if article.contents:
17 print(article.contents.markdown[:400])
18 print()

Crawl both web and news

Use livecrawl=all to crawl every result type in one request.

1from youdotcom import You
2from youdotcom.models import LiveCrawl, LiveCrawlFormats
3
4with You(api_key_auth="api_key") as you:
5 res = you.search.unified(
6 query="quantum computing breakthroughs",
7 count=5,
8 livecrawl=LiveCrawl.ALL,
9 livecrawl_formats=LiveCrawlFormats.MARKDOWN,
10 )
11
12 if res.results:
13 for result in (res.results.web or []):
14 if result.contents:
15 print(f"[WEB] {result.title}: {len(result.contents.markdown)} chars")
16
17 for result in (res.results.news or []):
18 if result.contents:
19 print(f"[NEWS] {result.title}: {len(result.contents.markdown)} chars")

Control crawl timeout

By default the crawler waits up to 10 seconds per page. For latency-sensitive applications, reduce crawl_timeout. For complex or slow-loading pages, increase it (up to 60 seconds).

1from youdotcom import You
2from youdotcom.models import LiveCrawl, LiveCrawlFormats
3
4with You(api_key_auth="api_key") as you:
5 # Low-latency pipeline: only wait 3 seconds per page
6 res = you.search.unified(
7 query="latest Python releases",
8 count=5,
9 livecrawl=LiveCrawl.WEB,
10 livecrawl_formats=LiveCrawlFormats.MARKDOWN,
11 crawl_timeout=3,
12 )
13
14 if res.results and res.results.web:
15 for result in res.results.web:
16 status = "crawled" if result.contents else "skipped (timeout)"
17 print(f"{result.title} — {status}")

HTML vs Markdown

FormatBest for
markdownLLM prompts, RAG, text analysis — clean, no boilerplate
htmlRendering, scraping structured data, preserving page layout

Already have URLs?

If you have a list of URLs and don’t need to search first, use the Contents API directly. It accepts URLs without a query and returns the same markdown or html content.


Next steps