# How to Evaluate You.com Search API
> A practical guide to benchmarking You.com's Search API: methodology, configurations, datasets, and real performance tradeoffs.
A practical guide to benchmarking You.com's Search API: methodology, configurations, datasets, and real performance tradeoffs.
## Why This Guide Exists
Most developer docs treat evaluation like checking boxes. This guide treats it like shipping production code: you need real benchmarks, honest tradeoffs, and configurations that actually work.
We'll cover:
* **Retrieval Quality** - Does it actually find what you need?
* **Latency** - Fast enough for your users?
* **Freshness** - Can it handle "what happened today?"
* **Cost** - What's your burn rate per query?
* **Agent Performance** - Does it work in multi-step reasoning workflows?
***
**Want help running your eval?** Our team can design and run custom benchmarks for your use case. [Talk to us](https://you.com/evaluations-as-a-service)
## The Golden Rule: Start Simple, Stay Fair
**TL;DR: Use default settings. Don't over-engineer your first eval.**
Most failed evaluations have one thing in common: people add too many parameters too early.
### Recommended Starting Point
```python
from youdotcom import You
with You("your_api_key") as you:
result = you.search.unified(
query=query,
count=10
)
# That's it. No date filters. No domain whitelists. Just search.
```
### When to Add Complexity
Add parameters ONLY when:
1. Your evaluation explicitly tests that feature (e.g., freshness requires the `freshness` parameter)
2. You've already run baseline evals and know what you're optimizing for
3. The parameter reflects actual production usage, not hypothetical edge cases
**Anti-pattern**: "Let me add every possible parameter to make this perfect"
**Better approach**: "Let me run this with defaults, measure performance, then iterate"
***
## API Parameters Reference
The Search API (`GET https://ydc-index.io/v1/search`) accepts these parameters:
| Parameter | Type | Purpose | When to Use |
| ------------------- | --------------------- | ------------------------------------------------------------------------ | ------------------------------ |
| `query` | string (required) | The search query | Always |
| `count` | integer (default: 10) | Max results per section (web/news) | Fix at 10 for fair comparisons |
| `freshness` | string | Time filter: `day`, `week`, `month`, `year`, or `YYYY-MM-DDtoYYYY-MM-DD` | Time-sensitive queries only |
| `livecrawl` | string | Get full page content: `web`, `news`, or `all` | Essential for RAG/synthesis |
| `livecrawl_formats` | string | Content format: `html` or `markdown` | When using livecrawl |
| `country` | string | Geographic focus (e.g., `US`, `GB`) | Location-specific queries |
| `language` | string | Result language in BCP 47 (default: `EN`) | Non-English queries |
| `safesearch` | string | Content filter: `off`, `moderate`, `strict` | Production (default: moderate) |
| `offset` | integer | Pagination (multiples of count) | Fetching more results |
***
## Latency: Compare Apples to Apples
**Critical insight: Never compare APIs with wildly different latency profiles.**
A 200ms API and a 3000ms API serve different use cases. Comparing them is like comparing a bicycle to a freight train.
### Latency Buckets
| Latency Class | Use Cases | Compare Within |
| ------------------------- | ----------------------------------------------- | ---------------------------------- |
| **Ultra-fast (\< 200ms)** | Autocomplete, real-time voice agents | Other sub-200ms systems |
| **Fast (200-800ms)** | Chatbots, user-facing QA | Similar mid-latency APIs |
| **Deep (>1000ms)** | Research, multi-hop reasoning, batch processing | Other comprehensive search systems |
### Fair Comparison Framework
```python
# Good: Comparing within the same latency class
compare_systems([
"You.com (350ms p50)",
"Competitor A (380ms p50)",
"Competitor B (290ms p50)"
])
# Bad: Comparing across latency classes
compare_systems([
"You.com (350ms p50)",
"Deep Research API (2800ms p50)" # Meaningless comparison
])
```
***
## Configuration Examples
### Minimal Config (Start Here)
```python
from youdotcom import You
with You("your_api_key") as you:
result = you.search.unified(
query=query,
count=10
)
```
### With Full Page Content (for RAG)
```python
result = you.search.unified(
query=query,
count=10,
livecrawl="web",
livecrawl_formats="markdown"
)
# Access full content via result.results.web[0].contents.markdown
```
### Freshness Config (Time-Sensitive Queries)
```python
result = you.search.unified(
query="latest AI news",
count=10,
freshness="day" # or "week", "month", "year", "2024-01-01to2024-12-31"
)
```
### Raw HTTP Request
```bash
curl -X GET "https://ydc-index.io/v1/search?query=climate+tech+startups&count=10&livecrawl=web&livecrawl_formats=markdown" \
-H "X-API-Key: your_api_key"
```
***
## Evaluation Workflow: 4 Steps That Actually Work
### 1. Define What You're Testing
Don't start with "let's evaluate everything." Start with:
* What capability matters? (speed? accuracy? freshness?)
* What latency can you tolerate?
* Single-step retrieval or multi-step reasoning?
**Example scope**: "We need 90%+ accuracy on customer support questions with \< 500ms latency"
### 2. Pick Your Dataset
| Dataset | Tests | Notes |
| ------------------ | ------------------------ | ------------------------------------------------------ |
| SimpleQA | Fast factual QA | Good baseline |
| FRAMES | Multi-step reasoning | Agentic workflows |
| FreshQA | Time-sensitive queries | Use with `freshness` param |
| Custom (your data) | Domain-specific accuracy | Start [here](https://you.com/evaluations-as-a-service) |
**Pro tip**: Start with public benchmarks, but your production queries are the real test.
**Need help building a custom dataset?** [We can help](https://you.com/evaluations-as-a-service)
### 3. Run Your Eval
```python
from youdotcom import You
import time
def run_eval(dataset_path, config):
results = []
with You("your_api_key") as you:
for item in load_dataset(dataset_path):
query = item['question']
expected = item['answer']
# Step 1: Retrieve
start = time.time()
search_results = you.search.unified(
query=query,
count=config.get('count', 10),
livecrawl=config.get('livecrawl'),
freshness=config.get('freshness')
)
latency = (time.time() - start) * 1000
# Step 2: Synthesize answer using your LLM
snippets = [r.snippets[0] for r in search_results.results.web if r.snippets]
context = "\n".join(snippets)
answer = llm.generate(
f"Answer using only this context:\n{context}\n\n"
f"Question: {query}\nAnswer:"
)
# Step 3: Grade
grade = evaluate_answer(expected, answer)
results.append({
'correct': grade == 'correct',
'latency_ms': latency
})
# Calculate metrics
accuracy = sum(r['correct'] for r in results) / len(results)
p50_latency = sorted([r['latency_ms'] for r in results])[len(results)//2]
return {
'accuracy': f"{accuracy:.1%}",
'p50_latency': f"{p50_latency:.0f}ms"
}
```
### 4. Analyze & Iterate
Look at:
* **Accuracy vs latency tradeoff** - Can you get 95% accuracy at 300ms?
* **Failure modes** - Which queries fail? Is there a pattern?
* **Cost** - What's your \$/1000 queries?
Then iterate:
* Add `livecrawl` if snippets aren't giving enough context
* Add `freshness` if failures are due to stale content
* Compare against competitors in the same latency class
***
## Response Structure
The API returns results in two sections:
```json
{
"results": {
"web": [
{
"url": "https://example.com/article",
"title": "Article Title",
"description": "Brief description",
"snippets": ["Relevant text from the page..."],
"page_age": "2025-01-20T12:00:00Z",
"contents": {
"markdown": "Full page content if livecrawl enabled"
}
}
],
"news": [
{
"title": "News Headline",
"description": "News summary",
"url": "https://news.example.com/story",
"page_age": "2025-01-25T08:00:00Z"
}
]
},
"metadata": {
"search_uuid": "uuid",
"query": "your query",
"latency": 0.234
}
}
```
***
## Tool Calling for Agents
When evaluating You.com in agentic workflows, keep the tool definition minimal.
**Open-source evaluation framework**: Check out [Agentic Web Search Playoffs](https://github.com/youdotcom-oss/agentic-web-search-playoffs) for a ready-to-use benchmark comparing web search providers in agentic contexts.
```python
search_tool = {
"type": "function",
"function": {
"name": "web_search",
"description": "Search the web using You.com. Returns relevant snippets and URLs.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
}
},
"required": ["query"]
}
}
}
```
**Note**: Don't expose `freshness`, `livecrawl`, or other parameters to the agent unless necessary. Let the agent focus on formulating good queries.
### Implementation
```python
def handle_tool_call(tool_call):
query = tool_call.arguments["query"]
with You("your_api_key") as you:
results = you.search.unified(query=query, count=10)
# Format for agent consumption
formatted = []
for r in results.results.web[:5]:
formatted.append({
"title": r.title,
"snippet": r.snippets[0] if r.snippets else r.description,
"url": r.url
})
return json.dumps(formatted)
```
***
## Common Mistakes to Avoid
### 1. Over-Filtering Too Early
**Don't:**
```python
result = you.search.unified(
query=query,
freshness="week",
country="US",
language="EN",
safesearch="strict"
)
```
**Do:**
```python
result = you.search.unified(query=query, count=10) # Start simple
```
### 2. Ignoring Your Actual Queries
**Don't just run:** Public benchmarks
**Also run:** Your actual user queries from production logs
### 3. Not Measuring What Users Care About
**Don't only measure:** Technical accuracy
**Also measure:** Click-through rate, task completion, reformulation rate
### 4. Testing in Isolation
**Don't test:** Search API alone
**Test:** Full workflow (search -> synthesis -> grading) with your actual LLM and prompts
***
## Debugging Performance Issues
### If Accuracy is Low (\< 85%)
1. Are you requesting enough results? Try `count=15`
2. Enable livecrawl for full page content:
```python
results = you.search.unified(
query=query,
count=15,
livecrawl="web",
livecrawl_formats="markdown"
)
```
3. Is your synthesis prompt good? Test with GPT-4
4. Is your grading fair? Manual review a sample
### If Results are Stale
```python
# Force fresh results
results = you.search.unified(
query=query,
count=10,
freshness="day" # or "week", "month"
)
```
***
**Still stuck?** Our team has run hundreds of search evals. [Get hands-on help](https://you.com/evaluations-as-a-service)
## Production Checklist
### 1. Run Comparative Benchmarks
```python
configs = [
{'count': 5},
{'count': 10},
{'count': 10, 'livecrawl': 'web', 'livecrawl_formats': 'markdown'}
]
for config in configs:
results = run_eval('your_dataset.json', config)
print(f"{config}: {results}")
```
### 2. Set Up Monitoring
```python
# What to log
{
'query': query,
'latency_ms': latency,
'num_results_returned': len(results.results.web),
'used_livecrawl': bool(livecrawl),
'freshness_filter': freshness,
'timestamp': now()
}
```
### 3. Document Everything
```markdown
## Search Evaluation - 2025-01-26
**Dataset**: 500 customer support queries
**Config**: count=10, livecrawl=web
**Results**:
- Accuracy: 91.2%
- P50 Latency: 445ms
- P95 Latency: 892ms
**Decision**: Ship with livecrawl enabled - improves synthesis quality
```
***
## Getting Help
* **[Evaluations as a Service](https://you.com/evaluations-as-a-service)** - Custom benchmarks designed and run by our team
* **[Agentic Web Search Playoffs](https://github.com/youdotcom-oss/agentic-web-search-playoffs)** - Open-source benchmark for comparing web search in agentic workflows
* [API Documentation](https://docs.you.com/)
* [Discord Community](https://discord.gg/you)
* Email: [developers@you.com](mailto:developers@you.com)
***
**Remember**: The best evaluation is the one you actually run. Start simple, measure what matters, and iterate.