***

title: Retrieve page content
subtitle: >-
Live-crawl search results to get full HTML or Markdown page content. Ideal for
RAG, knowledge base construction, and deep content analysis.
'og:title': Retrieve Page Content | You.com Search API
'og:description': >-
Live-crawl search results to get full HTML or Markdown page content. Ideal for
RAG, knowledge base construction, and deep content analysis.
------------------------------------------------------------

## Overview

By default, search results include snippets — 100–200 words of extracted text per result. Enable **live crawling** to get the full page content: typically 2,000–10,000 words of clean HTML or Markdown per result.

This is what enables:

* Deep RAG with full document context
* Knowledge base construction from live web data
* Comprehensive content synthesis across sources
* Full article bodies for news results

## How it works

Add `livecrawl` to any search request. The API fetches each matching result's page in real time and attaches a `contents` object to it. You choose which result types to crawl and what format to return.

| Parameter           | Type    | Options                 | Description                  |
| ------------------- | ------- | ----------------------- | ---------------------------- |
| `livecrawl`         | string  | `web`, `news`, `all`    | Which result types to crawl  |
| `livecrawl_formats` | string  | `html`, `markdown`      | Format for returned content  |
| `crawl_timeout`     | integer | `1`–`60` (default `10`) | Max seconds to wait per page |

<Callout intent="info">
  `markdown` is recommended for LLM use cases — it strips navigation, ads, and boilerplate HTML, leaving only the core content.
</Callout>

## Crawl web results

Set `livecrawl=web` to attach full page content to web results. The `contents.markdown` (or `contents.html`) field is added to each result that was successfully crawled.

<CodeBlock>
  ```python
  from youdotcom import You
  from youdotcom.models import LiveCrawl, LiveCrawlFormats

  with You(api_key_auth="api_key") as you:
    res = you.search.unified(
      query="transformer architecture explained",
      count=5,
      livecrawl=LiveCrawl.WEB,
      livecrawl_formats=LiveCrawlFormats.MARKDOWN,
    )

    if res.results and res.results.web:
      for result in res.results.web:
        print(f"{result.title}")
        print(f"  URL: {result.url}")
        if result.contents:
          print(f"  Content ({len(result.contents.markdown)} chars)")
          print(f"  Preview: {result.contents.markdown[:200]}...\n")
        else:
          print("  (No content retrieved)\n")
  ```

  ```typescript
  import { You } from "@youdotcom-oss/sdk";
  import { LiveCrawl, LiveCrawlFormats } from "@youdotcom-oss/sdk/models";

  const you = new You({
    apiKeyAuth: "api_key",
  });

  async function run() {
    const result = await you.search({
      query: "transformer architecture explained",
      count: 5,
      livecrawl: LiveCrawl.Web,
      livecrawlFormats: LiveCrawlFormats.Markdown,
    });

    result.results?.web?.forEach((r) => {
      console.log(r.title);
      console.log(`  URL: ${r.url}`);
      if (r.contents?.markdown) {
        console.log(`  Content (${r.contents.markdown.length} chars)`);
        console.log(`  Preview: ${r.contents.markdown.slice(0, 200)}...\n`);
      } else {
        console.log("  (No content retrieved)\n");
      }
    });
  }

  run();
  ```

  ```curl
  curl -G https://ydc-index.io/v1/search \
    -H "X-API-Key: api_key" \
    --data-urlencode "query=transformer architecture explained" \
    -d count=5 \
    -d livecrawl=web \
    -d livecrawl_formats=markdown
  ```
</CodeBlock>

## Crawl news results

Set `livecrawl=news` to get full article bodies for news results. Combine with `freshness` for breaking news pipelines.

<CodeBlock>
  ```python
  from youdotcom import You
  from youdotcom.models import Freshness, LiveCrawl, LiveCrawlFormats

  with You(api_key_auth="api_key") as you:
    res = you.search.unified(
      query="semiconductor supply chain",
      freshness=Freshness.WEEK,
      count=5,
      livecrawl=LiveCrawl.NEWS,
      livecrawl_formats=LiveCrawlFormats.MARKDOWN,
    )

    if res.results and res.results.news:
      for article in res.results.news:
        print(f"{article.title}")
        if article.contents:
          print(article.contents.markdown[:400])
        print()
  ```

  ```typescript
  import { You } from "@youdotcom-oss/sdk";
  import { Freshness, LiveCrawl, LiveCrawlFormats } from "@youdotcom-oss/sdk/models";

  const you = new You({
    apiKeyAuth: "api_key",
  });

  async function run() {
    const result = await you.search({
      query: "semiconductor supply chain",
      freshness: Freshness.Week,
      count: 5,
      livecrawl: LiveCrawl.News,
      livecrawlFormats: LiveCrawlFormats.Markdown,
    });

    result.results?.news?.forEach((article) => {
      console.log(article.title);
      if (article.contents?.markdown) {
        console.log(article.contents.markdown.slice(0, 400));
      }
      console.log();
    });
  }

  run();
  ```

  ```curl
  curl -G https://ydc-index.io/v1/search \
    -H "X-API-Key: api_key" \
    --data-urlencode "query=semiconductor supply chain" \
    -d freshness=week \
    -d count=5 \
    -d livecrawl=news \
    -d livecrawl_formats=markdown
  ```
</CodeBlock>

## Crawl both web and news

Use `livecrawl=all` to crawl every result type in one request.

<CodeBlock>
  ```python
  from youdotcom import You
  from youdotcom.models import LiveCrawl, LiveCrawlFormats

  with You(api_key_auth="api_key") as you:
    res = you.search.unified(
      query="quantum computing breakthroughs",
      count=5,
      livecrawl=LiveCrawl.ALL,
      livecrawl_formats=LiveCrawlFormats.MARKDOWN,
    )

    if res.results:
      for result in (res.results.web or []):
        if result.contents:
          print(f"[WEB] {result.title}: {len(result.contents.markdown)} chars")

      for result in (res.results.news or []):
        if result.contents:
          print(f"[NEWS] {result.title}: {len(result.contents.markdown)} chars")
  ```

  ```typescript
  import { You } from "@youdotcom-oss/sdk";
  import { LiveCrawl, LiveCrawlFormats } from "@youdotcom-oss/sdk/models";

  const you = new You({
    apiKeyAuth: "api_key",
  });

  async function run() {
    const result = await you.search({
      query: "quantum computing breakthroughs",
      count: 5,
      livecrawl: LiveCrawl.All,
      livecrawlFormats: LiveCrawlFormats.Markdown,
    });

    result.results?.web?.forEach((r) => {
      if (r.contents?.markdown) {
        console.log(`[WEB] ${r.title}: ${r.contents.markdown.length} chars`);
      }
    });

    result.results?.news?.forEach((r) => {
      if (r.contents?.markdown) {
        console.log(`[NEWS] ${r.title}: ${r.contents.markdown.length} chars`);
      }
    });
  }

  run();
  ```

  ```curl
  curl -G https://ydc-index.io/v1/search \
    -H "X-API-Key: api_key" \
    --data-urlencode "query=quantum computing breakthroughs" \
    -d count=5 \
    -d livecrawl=all \
    -d livecrawl_formats=markdown
  ```
</CodeBlock>

## Control crawl timeout

By default the crawler waits up to 10 seconds per page. For latency-sensitive applications, reduce `crawl_timeout`. For complex or slow-loading pages, increase it (up to 60 seconds).

<CodeBlock>
  ```python
  from youdotcom import You
  from youdotcom.models import LiveCrawl, LiveCrawlFormats

  with You(api_key_auth="api_key") as you:
    # Low-latency pipeline: only wait 3 seconds per page
    res = you.search.unified(
      query="latest Python releases",
      count=5,
      livecrawl=LiveCrawl.WEB,
      livecrawl_formats=LiveCrawlFormats.MARKDOWN,
      crawl_timeout=3,
    )

    if res.results and res.results.web:
      for result in res.results.web:
        status = "crawled" if result.contents else "skipped (timeout)"
        print(f"{result.title} — {status}")
  ```

  ```typescript
  import { You } from "@youdotcom-oss/sdk";
  import { LiveCrawl, LiveCrawlFormats } from "@youdotcom-oss/sdk/models";

  const you = new You({
    apiKeyAuth: "api_key",
  });

  async function run() {
    // Low-latency pipeline: only wait 3 seconds per page
    const result = await you.search({
      query: "latest Python releases",
      count: 5,
      livecrawl: LiveCrawl.Web,
      livecrawlFormats: LiveCrawlFormats.Markdown,
      crawlTimeout: 3,
    });

    result.results?.web?.forEach((r) => {
      const status = r.contents ? "crawled" : "skipped (timeout)";
      console.log(`${r.title} — ${status}`);
    });
  }

  run();
  ```

  ```curl
  curl -G https://ydc-index.io/v1/search \
    -H "X-API-Key: api_key" \
    --data-urlencode "query=latest Python releases" \
    -d count=5 \
    -d livecrawl=web \
    -d livecrawl_formats=markdown \
    -d crawl_timeout=3
  ```
</CodeBlock>

## HTML vs Markdown

| Format     | Best for                                                    |
| ---------- | ----------------------------------------------------------- |
| `markdown` | LLM prompts, RAG, text analysis — clean, no boilerplate     |
| `html`     | Rendering, scraping structured data, preserving page layout |

## Already have URLs?

If you have a list of URLs and don't need to search first, use the [Contents API](/contents/overview) directly. It accepts URLs without a query and returns the same `markdown` or `html` content.

***

## Next steps

<CardGroup cols={2}>
  <Card title="Contents API" icon="fa-regular fa-link" href="/contents/overview">
    Fetch page content directly from URLs, no search query needed
  </Card>

  <Card title="Get live news" icon="fa-regular fa-newspaper" href="/search/live-news">
    Retrieve and filter real-time news results
  </Card>

  <Card title="API reference" icon="fa-regular fa-code" href="/api-reference/search/v1-search">
    View all parameters and response schemas
  </Card>

  <Card title="Search overview" icon="fa-regular fa-magnifying-glass" href="/search/overview">
    Back to Search API overview
  </Card>
</CardGroup>