How to Evaluate You.com Search API | YDC

A practical guide to benchmarking You.com’s Search API: methodology, datasets, and real performance tradeoffs.

New to the Search API? Start with the Search API Overview for a full parameter reference and feature walkthrough, then come back here when you’re ready to run a structured evaluation.

Why This Guide Exists

Most developer docs treat evaluation like checking boxes. This guide treats it like shipping production code: you need real benchmarks, honest tradeoffs, and configurations that actually work.

We’ll cover:

Retrieval Quality — Does it actually find what you need?
Latency — Fast enough for your users?
Freshness — Can it handle “what happened today?”
Cost — What’s your burn rate per query?
Agent Performance — Does it work in multi-step reasoning workflows?

Want help running your eval? Our team can design and run custom benchmarks for your use case. Talk to us

The Golden Rule: Start Simple, Stay Fair

TL;DR: Use default settings. Don’t over-engineer your first eval.

Most failed evaluations have one thing in common: people add too many parameters too early.

Recommended Starting Point

1 from youdotcom import You
2 
3 with You("api_key") as you:
4     result = you.search.unified(
5         query=query,
6         count=10
7     )
8 
9 # That's it. No date filters. No domain whitelists. Just search.

When to Add Complexity

Add parameters ONLY when:

Your evaluation explicitly tests that feature (e.g., freshness requires the freshness parameter)
You’ve already run baseline evals and know what you’re optimizing for
The parameter reflects actual production usage, not hypothetical edge cases

Anti-pattern: “Let me add every possible parameter to make this perfect”

Better approach: “Let me run this with defaults, measure performance, then iterate”

For a full reference of available parameters and their defaults, see the Search API Overview.

Latency: Compare Apples to Apples

Critical insight: Never compare APIs with wildly different latency profiles.

A 200ms API and a 3000ms API serve different use cases. Comparing them is like comparing a bicycle to a freight train.

Latency Buckets

Latency Class	Use Cases	Compare Within
Ultra-fast (< 200ms)	Autocomplete, real-time voice agents	Other sub-200ms systems
Fast (200-800ms)	Chatbots, user-facing QA	Similar mid-latency APIs
Deep (>1000ms)	Research, multi-hop reasoning, batch processing	Other comprehensive search systems

Fair Comparison Framework

1 # Good: Comparing within the same latency class
2 compare_systems([
3     "You.com (350ms p50)",
4     "Competitor A (380ms p50)",
5     "Competitor B (290ms p50)"
6 ])
7 
8 # Bad: Comparing across latency classes
9 compare_systems([
10     "You.com (350ms p50)",
11     "Deep Research API (2800ms p50)"  # Meaningless comparison
12 ])

Evaluation Workflow: 4 Steps That Actually Work

1. Define What You’re Testing

Don’t start with “let’s evaluate everything.” Start with:

What capability matters? (speed? accuracy? freshness?)
What latency can you tolerate?
Single-step retrieval or multi-step reasoning?

Example scope: “We need 90%+ accuracy on customer support questions with < 500ms latency”

2. Pick Your Dataset

Dataset	Tests	Notes
SimpleQA	Fast factual QA	Good baseline
FRAMES	Multi-step reasoning	Agentic workflows
FreshQA	Time-sensitive queries	Use with `freshness` param
Custom (your data)	Domain-specific accuracy	Start here

Pro tip: Start with public benchmarks, but your production queries are the real test.

Need help building a custom dataset? We can help

3. Run Your Eval

1 from youdotcom import You
2 import time
3 
4 def run_eval(dataset_path, config):
5     results = []
6 
7     with You("api_key") as you:
8         for item in load_dataset(dataset_path):
9             query = item['question']
10             expected = item['answer']
11 
12             # Step 1: Retrieve
13             start = time.time()
14             search_results = you.search.unified(
15                 query=query,
16                 count=config.get('count', 10),
17                 livecrawl=config.get('livecrawl'),
18                 freshness=config.get('freshness')
19             )
20             latency = (time.time() - start) * 1000
21 
22             # Step 2: Synthesize answer using your LLM
23             snippets = [r.snippets[0] for r in search_results.results.web if r.snippets]
24             context = "\n".join(snippets)
25             answer = llm.generate(
26                 f"Answer using only this context:\n{context}\n\n"
27                 f"Question: {query}\nAnswer:"
28             )
29 
30             # Step 3: Grade
31             grade = evaluate_answer(expected, answer)
32 
33             results.append({
34                 'correct': grade == 'correct',
35                 'latency_ms': latency
36             })
37 
38     # Calculate metrics
39     accuracy = sum(r['correct'] for r in results) / len(results)
40     p50_latency = sorted([r['latency_ms'] for r in results])[len(results)//2]
41 
42     return {
43         'accuracy': f"{accuracy:.1%}",
44         'p50_latency': f"{p50_latency:.0f}ms"
45     }

4. Analyze & Iterate

Look at:

Accuracy vs latency tradeoff - Can you get 95% accuracy at 300ms?
Failure modes - Which queries fail? Is there a pattern?
Cost - What’s your $/1000 queries?

Then iterate:

Add livecrawl if snippets aren’t giving enough context
Add freshness if failures are due to stale content
Compare against competitors in the same latency class

Tool Calling for Agents

When evaluating You.com in agentic workflows, keep the tool definition minimal.

Open-source evaluation framework: Check out Agentic Web Search Playoffs for a ready-to-use benchmark comparing web search providers in agentic contexts.

1 search_tool = {
2     "type": "function",
3     "function": {
4         "name": "web_search",
5         "description": "Search the web using You.com. Returns relevant snippets and URLs.",
6         "parameters": {
7             "type": "object",
8             "properties": {
9                 "query": {
10                     "type": "string",
11                     "description": "The search query"
12                 }
13             },
14             "required": ["query"]
15         }
16     }
17 }

Note: Don’t expose freshness, livecrawl, or other parameters to the agent unless necessary. Let the agent focus on formulating good queries.

Implementation

1 def handle_tool_call(tool_call):
2     query = tool_call.arguments["query"]
3 
4     with You("api_key") as you:
5         results = you.search.unified(query=query, count=10)
6 
7     # Format for agent consumption
8     formatted = []
9     for r in results.results.web[:5]:
10         formatted.append({
11             "title": r.title,
12             "snippet": r.snippets[0] if r.snippets else r.description,
13             "url": r.url
14         })
15 
16     return json.dumps(formatted)

Common Mistakes to Avoid

1. Over-Filtering Too Early

Don’t:

1 result = you.search.unified(
2     query=query,
3     freshness="week",
4     country="US",
5     language="EN",
6     safesearch="strict"
7 )

Do:

1 result = you.search.unified(query=query, count=10)  # Start simple

2. Ignoring Your Actual Queries

Don’t just run: Public benchmarks

Also run: Your actual user queries from production logs

3. Not Measuring What Users Care About

Don’t only measure: Technical accuracy

Also measure: Click-through rate, task completion, reformulation rate

4. Testing in Isolation

Don’t test: Search API alone

Test: Full workflow (search -> synthesis -> grading) with your actual LLM and prompts

Debugging Performance Issues

If Accuracy is Low (< 85%)

Are you requesting enough results? Try count=15
Enable livecrawl for full page content:

1 results = you.search.unified(
2     query=query,
3     count=15,
4     livecrawl="web",
5     livecrawl_formats="markdown"
6 )

Is your synthesis prompt good? Test with GPT-4
Is your grading fair? Manual review a sample

If Results are Stale

1 # Force fresh results
2 results = you.search.unified(
3     query=query,
4     count=10,
5     freshness="day"  # or "week", "month"
6 )

Still stuck? Our team has run hundreds of search evals. Get hands-on help

Production Checklist

1. Run Comparative Benchmarks

1 configs = [
2     {'count': 5},
3     {'count': 10},
4     {'count': 10, 'livecrawl': 'web', 'livecrawl_formats': 'markdown'}
5 ]
6 
7 for config in configs:
8     results = run_eval('your_dataset.json', config)
9     print(f"{config}: {results}")

2. Set Up Monitoring

1 # What to log
2 {
3     'query': query,
4     'latency_ms': latency,
5     'num_results_returned': len(results.results.web),
6     'used_livecrawl': bool(livecrawl),
7     'freshness_filter': freshness,
8     'timestamp': now()
9 }

3. Document Everything

1 ## Search Evaluation - 2025-01-26
2 
3 **Dataset**: 500 customer support queries
4 **Config**: count=10, livecrawl=web
5 **Results**:
6 - Accuracy: 91.2%
7 - P50 Latency: 445ms
8 - P95 Latency: 892ms
9 
10 **Decision**: Ship with livecrawl enabled - improves synthesis quality

Getting Help

Evaluations as a Service - Custom benchmarks designed and run by our team
Agentic Web Search Playoffs - Open-source benchmark for comparing web search in agentic workflows
API Documentation
Discord Community
Email: developers@you.com

Remember: The best evaluation is the one you actually run. Start simple, measure what matters, and iterate.